Practical Data Engineering Substack

Practical Data Engineering Substack focuses on data integration, storage engines, distributed systems, and infrastructure. Topics include open-source tools, table formats, data extraction, and pipeline design. Emphasis is on staying updated with trends and tools while offering practical techniques for efficient data management.

data integration storage engines distributed systems infrastructure data tools data lakes data pipelines

The hottest Substack posts of Practical Data Engineering Substack

And their main takeaways

The History and Evolution of Open Table Formats - Part II

79 implied HN points • 18 Aug 24

The evolution of open table formats has improved how we manage data by introducing log-oriented designs. These designs help us keep track of data changes and make data management more efficient.
Modern open table formats like Apache Hudi and Delta Lake offer database-like features on data lakes, ensuring data integrity and allowing for easier updates and querying.
New projects are working on creating a unified table format that can work with different technologies. This means that in the future, switching between data formats could be simpler and more streamlined.

Open Source Data Engineering Landscape 2024

299 implied HN points • 28 Jan 24

🕹 Technology Data Engineering Open Source Software Tools Data processing Data Integration

The open-source data engineering landscape is growing fast, with many new tools and frameworks emerging. Staying updated on these tools is important for data engineers to pick the best options for their needs.
There are different categories of open-source tools like storage systems, data integration, and workflow management. Each category has established players and new contenders, helping businesses solve specific data challenges.
Emerging trends include decoupling storage and compute resources and the rise of unified data lakehouse layers. These advancements make data storage and processing more efficient and flexible.

Common Techniques For Periodically Extracting Data From Relational Databases

59 implied HN points • 18 Sep 23

🕹 Technology Data Engineering Data Extraction Data Warehousing

Data extraction from relational databases is important for building data pipelines. It's key to choose the right method based on factors like the type and size of the data.
There are several techniques to detect changes in the data, including using timestamps and database triggers. These help identify what new or changed records need to be extracted.
Different data extraction patterns exist, like periodic full reloads or incremental updates. Each method has its pros and cons, and the choice depends on the specific data needs and conditions.

How to build a dual Incremental + snapshot data ingestion pipeline

59 implied HN points • 01 Oct 23

🕹 Technology Data Engineering Data Pipelines Real-Time Processing Data Management

You can improve data accuracy by using two pipelines: one for getting recent updates quickly and another for regularly loading the entire dataset. This helps in keeping the data reliable over time.
It's essential to manage pipeline scheduling based on your business's needs, like how often you need updates. You can choose faster updates or less frequent full reloads depending on how critical the data is.
Using tools like Apache Airflow can help organize these pipelines efficiently. You can simplify tasks by dynamically generating them from a list, making it easier to handle many data tables.

The History and Evolution of Open Table Formats - Part I

2 HN points • 15 Aug 24

🕹 Technology Data Management Big Data Database Systems Data Engineering

Open Table Formats have changed how we store and manage data, making it easier to work with different systems and tools without being locked into one software.
The transition from traditional databases to open table formats has increased flexibility and allowed for better collaboration across various platforms, especially in data lakes.
Despite their advantages, old formats like Hive still face issues like slow performance and over-partitioning, which can make data management challenging as companies grow.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Managing Dependencies Between Data Pipelines

0 implied HN points • 26 Aug 23

🕹 Technology Data Engineering Data Pipelines Workflow management Analytics Cloud Computing

Managing dependencies between data pipelines is crucial for ensuring that upstream tasks are completed before downstream tasks start. This avoids issues with incomplete or faulty data.
There are different techniques to manage these dependencies, ranging from simple time-based scheduling to more complex orchestrations that adjust based on the successful completion of previous tasks.
Choosing the right method for managing pipeline dependencies depends on the complexity of the data workflows and the need for independence between different teams and tasks.

DLD #1 | Data Landscape Digest 🗞️

0 implied HN points • 25 Aug 24

🕹 Technology Data Engineering Open Source Cloud Computing Software Development Analytics

Data engineering is evolving rapidly, and staying updated on new tools and technologies is important for success in the field.
Mastering the fundamentals, like SQL and Python, is crucial as they form the foundation for using advanced tools effectively.
Open source solutions, like Apache Hudi and XTable, are gaining popularity and can provide great benefits for managing data efficiently.

Internal Storage Design of Modern Key-value Database Engines [Part 3]

0 implied HN points • 13 Aug 23

🕹 Technology Database Systems Data Storage Computer Science Software Engineering

Compaction is an important process in key-value databases that helps combine and clean up data files. It removes old or unnecessary data and merges smaller files to make storage more efficient.
Different compaction strategies exist, like Leveled and Size-Tiered Compaction, each with its own benefits and challenges. The choice of strategy depends on the database's read and write patterns.
The RUM Conjecture explains the trade-offs in database optimization, balancing read, write, and space efficiency. Improving one aspect can worsen another, so it's key to find the right balance for specific needs.

Internal Storage Design of Modern Key-value Database Engines [Part 2]

0 implied HN points • 09 Aug 23

🕹 Technology Database Systems Data Structures Software Engineering Cloud Computing

Sorted Segment files, or SSTables, help databases manage data more efficiently by keeping key-value records in order. This sorting makes searching and accessing data faster.
In-memory storage, called Memtables, acts like a buffer that groups new data before it's saved to disk. This keeps data organized and speeds up how quickly new information can be written.
Using a structure called the LSM-Tree helps optimize how databases write and read data. It focuses on reducing the time and effort it takes to handle a lot of updates and inserts, which is common in many apps.

Internal Storage Design of Modern Key-value Database Engines [Part 1]

0 implied HN points • 05 Aug 23

🕹 Technology Databases Software Data Structures Engineering

Key-value stores use a simple model where each piece of data has a unique key and its associated value. This makes them great for fast lookups, especially when you only need to search by key.
The log-structured data design helps improve writing speed by storing data in order and delaying updates until they're batched together. This means the system can handle many writes quickly.
Many modern key-value stores are inspired by early successes like Amazon's DynamoDB and Google's BigTable. These systems have shaped how newer ones are built to be efficient and scalable.

Internal Storage Design of Modern Key-value Database Engines [Part 4]

0 implied HN points • 19 Aug 23

🕹 Technology Database Systems Data Engineering

LSM-Trees are designed to improve the performance of key-value databases, especially for write operations, but they can struggle with reading data quickly.
Innovations like separating keys from values in storage models, like WiscKey, help reduce I/O overhead and improve speed, particularly when using SSDs.
Using multi-channel SSDs can further boost performance for LSM-Trees, allowing for faster data processing and better overall efficiency.