The hottest Data Engineering Substack posts right now

And their main takeaways

PyIceberg: Current State and Roadmap

Ju Data Engineering Newsletter • 396 implied HN points • 28 Oct 24

Improving the user interface is crucial for more teams to use Iceberg, especially those that use Python for their data work.
PyIceberg, which is a Python implementation, is evolving quickly and currently supports various catalog and file system types.
While PyIceberg makes it easy to read and write data, it has some limitations, especially compared to using Iceberg with Spark, like handling deletes and managing metadata.

Iceberg + Single Node Engines

Ju Data Engineering Newsletter • 515 implied HN points • 17 Oct 24

🕹 Technology Data Engineering Cloud Computing Big Data Software Development Data Management

The use of Iceberg allows for separate storage and compute, making it easier to connect single-node engines to the data pipeline without needing extra steps.
There are different approaches to integrating single-node engines, including running all processes in one worker or handling each transformation with separate workers.
Partitioning data can improve efficiency by allowing independent processing of smaller chunks, which reduces the limitations of memory and speeds up data handling.

I spent 8 hours learning Parquet. Here’s what I discovered

VuTrinh. • 1658 implied HN points • 24 Aug 24

🕹 Technology Data Engineering Data Storage Data processing Analytics

Parquet is a special file format that organizes data in columns. This makes it easier and faster to access specific data when you don't need everything at once.
The structure of Parquet involves grouping data into row groups and column chunks. This helps balance the performance of reading and writing data, allowing users to manage large datasets efficiently.
Parquet uses smart techniques like dictionary and run-length encoding to save space. These methods reduce the amount of data stored and speed up the reading process by minimizing the data that needs to be scanned.

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

VuTrinh. • 399 implied HN points • 17 Sep 24

🕹 Technology Data Engineering Cloud Computing Database Systems Big Data

Metadata is really important because it helps organize and access data efficiently. It tells systems where files are and which ones can be ignored during processing.
Google's BigQuery uses a unique system to manage metadata that allows for quick access and analysis of huge datasets. Instead of putting metadata with the data, it keeps them separate but organized in a smart way.
The way BigQuery handles metadata improves performance by making sure that only the relevant information is accessed when running queries. This helps save time and resources, especially with very large data sets.

Kubernetes for Data Engineers

VuTrinh. • 859 implied HN points • 03 Sep 24

🕹 Technology Data Engineering Cloud Computing DevOps Software Development Infrastructure

Kubernetes is a powerful tool for managing containers, which are bundles of apps and their dependencies. It helps you run and scale many containers across different servers smoothly.
Understanding how Kubernetes works is key. It compares the actual state of your application with the desired state to make adjustments, ensuring everything runs as expected.
To start with Kubernetes, begin small and simple. Use local tools for practice, and learn step-by-step to avoid feeling overwhelmed by its many components.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Apache Iceberg Isn't Coming To Save You

SeattleDataGuy’s Newsletter • 341 implied HN points • 27 May 25

🕹 Technology Data science Data Engineering Software Development Information Systems Cloud Computing

Apache Iceberg might seem appealing, but it won't automatically solve your data problems. It's important to really understand what issues you're trying to address before jumping in.
Switching to new tools like Iceberg won't fix a broken data strategy. The focus should be on delivering real business value, not just adopting the latest technology.
If your data team is already doing well and looking to improve, Iceberg could be useful. But make sure it's the right fit for your specific challenges instead of following trends.

I spent 5 hours learning how Google lets us build a Lakehouse.

VuTrinh. • 139 implied HN points • 24 Sep 24

🕹 Technology Cloud Computing Data Engineering Software Development Information Storage

Google's BigLake allows users to access and manage data across different storage solutions like BigQuery and object storage. This makes it easier to work with big data without needing to move it around.
The Storage API enhances BigQuery by letting external tools like Apache Spark and Trino directly access its stored data, speeding up the data processing and analysis.
BigLake tables offer strong security features and better performance for querying open-source data formats, making it a more robust option for businesses that need efficient data management.

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

VuTrinh. • 279 implied HN points • 14 Sep 24

🕹 Technology Data Engineering Big Data Cloud Computing Data Management Data Analytics

Uber evolved from simple data management with MySQL to a more complex system using Hadoop to handle huge amounts of data efficiently.
They faced challenges with data reliability and latency, which slowed down their ability to make quick decisions.
Uber introduced a system called Hudi that allowed for faster updates and better data management, helping them keep their data fresh and accurate.

How do we run Kafka 100% on the object storage?

VuTrinh. • 519 implied HN points • 27 Aug 24

🕹 Technology Software Cloud Computing Data Engineering

AutoMQ enables Kafka to run entirely on object storage, which improves efficiency and scalability. This design removes the need for tightly-coupled compute and storage, allowing more flexible resource management.
AutoMQ uses a unique caching system to handle data, which helps maintain fast performance for both recent and historical data. It has separate caches for immediate and long-term data needs, enhancing read and write speeds.
Reliability in AutoMQ is ensured through a Write Ahead Log system using AWS EBS, which helps recover data after crashes. This setup allows for fast failover and data persistence, so no messages get lost.

I spent 4 hours learning Apache Iceberg. Here's what I found.

VuTrinh. • 799 implied HN points • 10 Aug 24

🕹 Technology Data Engineering Software Development Database Management Big Data Cloud Computing

Apache Iceberg is a table format that helps manage data in a data lake. It makes it easier to organize files and allows users to interact with data without worrying about how it's stored.
Iceberg has a three-layer architecture: data, metadata, and catalog, which work together to track and manage the actual data and its details. This structure allows for efficient querying and data operations.
One cool feature of Iceberg is its ability to time travel, meaning you can access previous versions of your data. This lets you see changes and retrieve earlier data as needed.

I spent 7 hours diving deep into Apache Iceberg

VuTrinh. • 339 implied HN points • 31 Aug 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Database Management

Apache Iceberg organizes data into a data layer and a metadata layer, making it easier to manage large datasets. The data layer holds the actual records, while the metadata layer keeps track of those records and their changes.
Iceberg's manifest files help improve read performance by storing statistics for multiple data files in one place. This means the reader can access all needed statistics without opening each individual data file.
Hidden partitioning in Iceberg allows users to filter data without needing extra columns, saving space. It records transformations on columns instead, helping streamline queries and manage data efficiently.

How did Discord evolve to handle trillions of data points

VuTrinh. • 399 implied HN points • 20 Aug 24

🕹 Technology Data Engineering Software Tools Infrastructure Open Source Data Analytics

Discord started with its own tool called Derived to manage data, but it found this system limited as it grew. They needed a better way to handle complex data tasks.
They switched to using popular tools like Dagster and dbt. This helped them automate and better manage their data processes.
With the new setup, Discord can now make changes quickly and safely, which improves how they analyze and use their vast amounts of data.

How does Notion handle 200 billion data entities?

VuTrinh. • 519 implied HN points • 06 Aug 24

🕹 Technology Data Engineering Database Management Analytics Machine Learning

Notion uses a flexible block system, letting users customize how they organize their notes and projects. Each block can be changed and moved around, making it easy to create what you need.
To manage the huge amount of data, Notion shifted from a single database to a more complex setup with multiple shards and instances. This change helps them handle stronger user demands and analytics needs more efficiently.
By creating an in-house data lake, Notion saved a lot of money and improved data processing speed. This new system allows them to quickly get data from their main database for analytics and support new features like AI.

Data Science Weekly - Issue 563

Data Science Weekly Newsletter • 139 implied HN points • 05 Sep 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

AI prompt engineering is becoming more important, and experts share helpful tips on how to improve your skill in this area.
Researchers in AI should focus on making an impact through their work by creating open-source resources and better benchmarks.
Data quality is a common concern in many organizations, yet many leaders struggle to prioritize it properly and invest in solutions.

Data Science Weekly - Issue 562

Data Science Weekly Newsletter • 179 implied HN points • 29 Aug 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Statistics

Distributed systems are changing a lot. This affects how we operate and program these systems, making them more secure and easier to manage.
Statistics are really important in everyday life, even if we don't see it. Talks this year aim to inspire students to understand and appreciate statistics better.
Understanding how AI models work internally is a growing field. Many AI systems are complex, and researchers want to learn how they make decisions and produce outputs.

How did Facebook design their Real-Time Processing ecosystem

VuTrinh. • 279 implied HN points • 17 Aug 24

🕹 Technology Data Engineering Real-Time Processing System Design Software Architecture Big Data

Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.

How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

VuTrinh. • 299 implied HN points • 13 Aug 24

🕹 Technology Data Engineering Infrastructure Software Development Open Source

LinkedIn uses Apache Kafka to manage a massive flow of information, handling around 7 trillion messages every day. They set up a complex system of clusters and brokers to ensure everything runs smoothly.
To keep everything organized, LinkedIn has a tiered system where data is processed locally in each data center, then sent to an aggregate cluster. This helps them avoid issues from moving data across different locations.
LinkedIn has an auditing tool to make sure all messages are tracked and nothing gets lost during transmission. This helps them quickly identify any problems and fix them efficiently.

Netflix Data Engineer Stack

VuTrinh. • 359 implied HN points • 30 Jul 24

🕹 Technology Data Engineering Software Tools Streaming Analytics Infrastructure

Netflix's data engineering stack uses tools like Apache Iceberg and Spark for building batch data pipelines. This helps them transform and manage large amounts of data efficiently.
For real-time data processing, Netflix relies on Apache Flink and a tool called Keystone. This setup makes it easier to handle streaming data and send it where it needs to go.
To ensure data quality and scheduling, Netflix has developed tools like the WAP pattern for auditing data and Maestro for managing workflows. These tools help keep the data process organized and reliable.

Diving Deep into LinkedIn's Data Infrastructure: My 6-Hour Learning & Key Takeaways

VuTrinh. • 299 implied HN points • 03 Aug 24

🕹 Technology Data Engineering Software Architecture Databases Distributed Systems Cloud Computing

LinkedIn's data infrastructure is organized into three main tiers: data, service, and display. This setup helps the system to scale easily without moving data around.
Voldemort is LinkedIn's key-value store that efficiently handles high-traffic queries and allows easy scaling by adding new nodes without downtime.
Databus is a change data capture system that keeps LinkedIn's databases synchronized across applications, allowing for quick updates and consistent data flow.

Apache Kafka - Overview

VuTrinh. • 539 implied HN points • 06 Jul 24

🕹 Technology Data Engineering Software Development Systems Architecture Distributed Systems

Apache Kafka is a system for handling large amounts of data messages, making it easier for companies like LinkedIn to manage and analyze user activity and other important metrics.
In Kafka, messages are organized into topics and divided into partitions, allowing for better performance and scalability. This way, different servers can handle parts of the data at once.
Kafka uses a pull model for consumers, meaning they can request data as they need it. This helps prevent overwhelming the consumers with too much data at once.

Practical Data Engineering using AWS Cloud Technologies

VuTrinh. • 339 implied HN points • 23 Jul 24

🕹 Technology Cloud Computing Data Engineering Software Development Information Systems

AWS offers a variety of tools for data engineering like S3, Lambda, and Step Functions, which can help anyone build scalable projects. These tools are often underused compared to newer options but are still very effective.
Services like SNS and SQS can help manage data flow and processing. SNS allows for publishing messages while SQS aids in handling high event volumes asynchronously.
Using AWS for data engineering is often simpler than switching to modern tools. It's easier to add new AWS services to your existing workflow than to migrate to something completely new.

Data Science Weekly - Issue 561

Data Science Weekly Newsletter • 139 implied HN points • 22 Aug 24

🕹 Technology Data science AI Machine Learning Data Engineering Visualization

When building web applications, using Postgres for data storage is a good default choice. It's reliable and widely used.
A new study shows that agents can learn useful skills without rewards or guidance. They can explore and develop abilities just from observing a goal.
The list of important books and resources in Bayesian statistics is being compiled. It's a way to recognize influential ideas in this field.

Is It Time to Say Goodbye to Data Engineers?

SeattleDataGuy’s Newsletter • 812 implied HN points • 06 Feb 25

🕹 Technology Data Engineering Software Development Data Management Business Intelligence Analytics

Data engineers are often seen as roadblocks, but cutting them out can lead to major problems later on. Without them, the data can become messy and unmanageable.
Initially, removing data engineers may seem like a win because things move quickly. However, this speed can cause chaos as data quality suffers and standards break down.
A solid data strategy needs structure and governance. Rushing without proper planning can lead to a situation where everything collapses under the weight of disorganization.

Apache Kafka - Important Designs

VuTrinh. • 259 implied HN points • 13 Jul 24

🕹 Technology Data Engineering Software Design Systems Architecture Distributed Systems Programming

Kafka uses the operating system's filesystem to store data, which helps it run faster by leveraging the page cache. This avoids the need to keep too much data in memory, making it simpler to manage.
The way Kafka reads and writes data is done in a sequential order, which is more efficient than random access. This design improves performance, as accessing data in a sequence reduces delays.
Kafka groups messages together before sending them, which helps reduce the number of requests made to the system. This batching process improves performance by allowing larger, more efficient data transfers.

Python Data Engineering, July 2024

Monthly Python Data Engineering • 179 implied HN points • 25 Jul 24

🕹 Technology Software Development Data Engineering Open Source Programming Languages Data science

The Python Data Engineering newsletter focuses on key updates and tools for building data engineering projects, rather than just data science.
This month showcased rapid development in projects like Narwhals and Polars, with Narwhals making 26 releases and Polars reaching version 1.0.0.
Several other libraries, such as Great Tables and Dask, also had important updates, making it a busy month for Python data engineering tools.

Apache Kafka - Producer

VuTrinh. • 199 implied HN points • 20 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Distributed Systems Real-Time Processing

Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.

Beyond Big Tech: The Reality Of Data Engineering Outside Silicon Valley

SeattleDataGuy’s Newsletter • 847 implied HN points • 14 Dec 24

🕹 Technology Data Engineering Big Tech Infrastructure Data Systems Business Processes

Working in big tech offers many advantages like better tools and a strong focus on data. This environment makes it easier to get work done quickly and efficiently.
Many companies outside big tech struggle with data because it's not their main focus. They often use a mix of different tools that don't work well together, leading to confusion.
Without a strong data leader, companies may find it hard to prioritize data spending. If data isn't tied to profits, it's tougher to justify investing time and money into it.

5 Habits Of Highly Effective Data Engineers To Master in 2025

SeattleDataGuy’s Newsletter • 800 implied HN points • 20 Dec 24

🕹 Technology Data Engineering Career growth Technical Skills Project management Professional development

Being proactive means solving problems before they become bigger issues. If you see something that can be improved, go ahead and make that change instead of waiting for someone else to do it.
Make sure your contributions are visible, so people recognize your work. Share your successes and updates with your team and leadership to build a stronger reputation.
Become the go-to person for a specific area in your company. Focus on something valuable that can help others succeed, and make sure to share your knowledge and support with your team.

25 Unpopular and Contradictory Opinions on Data and Engineering

The Data Jargon Newsletter • 138 implied HN points • 23 Aug 24

🕹 Technology Data Engineering Software Engineering Business Intelligence Data Strategy Data products

If your data product isn't making money, it's really just an internal tool. It's important to focus on projects that add real value.
Having a good Business Intelligence team can often bring more benefits than trying to make fancy data products. Simple tools can lead to effective data use.
More data engineers can improve your data platform, but just adding analysts might not directly make your data team better. It's all about how the team fits with the organization.

Why Your Data Infrastructure Migration Project Will Fail (And How to Succeed)

SeattleDataGuy’s Newsletter • 376 implied HN points • 12 Feb 25

🕹 Technology Data Engineering Infrastructure Software Development Project management Change Management

Having a clear plan is crucial for successful data migration projects. You need to know what to move and in what order to avoid chaos.
Ownership of the migration process is important. There should be a clear leader or team responsible to keep everything on track.
Testing data after migration is a must. Just moving the data doesn't guarantee that it works the same way, so check for any discrepancies.

GroupBy #42: Paypal - Scaling Kafka

VuTrinh. • 219 implied HN points • 02 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Infrastructure

PayPal operates a massive Kafka system with over 85 clusters and handles around 1.3 trillion messages daily. They manage data growth by using multiple geographical data centers for efficiency.
To improve user experience and security, PayPal developed tools like the Kafka Config Service for easier broker management and added access control lists to restrict who can connect to their Kafka clusters.
PayPal focuses on automation and monitoring, implementing systems to quickly patch vulnerabilities and manage topics, while also optimizing metrics to quickly identify issues with their Kafka platform.

4 Trillion Events Daily at LinkedIn

VuTrinh. • 319 implied HN points • 08 Jun 24

🕹 Technology Data Engineering Real-Time Processing Machine Learning Software Development Cloud Computing

LinkedIn processes around 4 trillion events every day, using Apache Beam to unify their streaming and batch data processing. This helps them run pipelines more efficiently and save development time.
By switching to Apache Beam, LinkedIn significantly improved their performance metrics. For example, one pipeline's processing time went from over 7 hours to just 25 minutes.
Their anti-abuse systems became much faster with Beam, reducing the time taken to identify abusive actions from a day to just 5 minutes. This increase in efficiency greatly enhances user safety and experience.

Apache Kafka - Consumer

VuTrinh. • 119 implied HN points • 27 Jul 24

🕹 Technology Data Engineering Software Development Information Systems Data processing Cloud Computing

Kafka uses a pull model for consumers, allowing them to control the message retrieval rate. This helps consumers manage workloads without being overwhelmed.
Consumer groups in Kafka let multiple consumers share the load of reading from topics, but each partition is only read by one consumer at a time for efficient processing.
Kafka handles rebalancing when consumers join or leave a group. This can be done eagerly, stopping all consumers, or cooperatively, allowing ongoing consumption from unaffected partitions.

How Twitter processes 4 billion events in real-time daily

VuTrinh. • 339 implied HN points • 25 May 24

🕹 Technology Data Engineering Real-Time Processing Cloud Computing Data architecture Big Data

Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.

Monthly Python Data Engineering, August 2024

Monthly Python Data Engineering • 59 implied HN points • 19 Aug 24

🕹 Technology Software Data Engineering Open Source Programming Development

Datafusion Comet was released, making it easier and faster to use Apache Spark for data processing, which is great for improving performance.
Several major data tools like Datafusion, Arrow, and Dask updated their versions, showing ongoing improvements in speed, efficiency, and new features.
New dashboard solutions like Panel and updates in libraries such as CUDF reflect the growing interest in making data access and visualization easier for users.

How does Uber build real-time infrastructure to handle petabytes of data every day?

VuTrinh. • 659 implied HN points • 23 Mar 24

🕹 Technology Data Engineering Infrastructure Real-Time Processing Open Source Big Data

Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.

7 Lessons I Learned the Hard Way From 9+ Years as a Data Engineer

SeattleDataGuy’s Newsletter • 730 implied HN points • 21 Nov 24

🕹 Technology Data Engineering Software Development Career Advice Project management Skills Training

It's important to avoid building complex systems just for the sake of it. Focus on creating infrastructure that actually helps your team and the business.
If you don’t plan your data model, you’ll end up with a messy one. Always take the time to design it properly to make future work easier.
Good communication is really powerful. Being able to share your ideas clearly can help you get support and make a bigger impact in your projects.

Data Science Weekly - Issue 529

Data Science Weekly Newsletter • 999 implied HN points • 12 Jan 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Software Development

Using ChatGPT can help you budget better. It can track and categorize your spending easily.
When coding, it's important to find a balance between moving quickly and keeping your code well-structured. This is a real challenge for many developers.
Language models, like GPT-4, are becoming very advanced, but there are big philosophical questions about what that really means for intelligence and understanding.

GroupBy #43: Uber | Kafka - The Tiered Storage

VuTrinh. • 139 implied HN points • 09 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Information Systems Big Data

Uber recently introduced Kafka Tiered Storage, which allows storage and compute resources to work separately. This means you can add storage without needing to upgrade processing power.
The tiered storage system has two parts: local storage for fast access and remote storage for long-term data. This setup helps manage data efficiently and keeps the local storage less cluttered.
When you need older data, it can be accessed directly from the remote storage, allowing faster performance for applications that need quick access to recent messages.

GroupBy #44: Meta | The Data Stack

VuTrinh. • 119 implied HN points • 16 Jul 24

🕹 Technology Data Engineering Infrastructure Data Analytics Software Development Real-Time Processing

Meta uses a complex data warehouse to manage millions of tables and keeps data only as long as it's needed. Data is organized into namespaces for efficient querying.
They built tools like iData for data discovery and Scuba for real-time analytics. These tools help engineers find and analyze data quickly.
Data engineers at Meta develop pipelines mainly with SQL and Python, using internal tools for orchestration and monitoring to ensure everything runs smoothly.