VuTrinh.

VuTrinh covers comprehensive analyses and tutorials on various data engineering tools and technologies like Parquet, Apache Spark, Apache Kafka, Kubernetes, and data architectures from big tech firms. The posts highlight concepts, implementations, performance enhancements, and best practices in managing large datasets and real-time data processing.

Data Formats Data Processing Container Management Real-Time Data Processing Data Infrastructure Cloud Technologies Data Storage Data Management

The hottest Substack posts of VuTrinh.

And their main takeaways

I spent 8 hours learning Parquet. Here’s what I discovered

1658 implied HN points • 24 Aug 24

Parquet is a special file format that organizes data in columns. This makes it easier and faster to access specific data when you don't need everything at once.
The structure of Parquet involves grouping data into row groups and column chunks. This helps balance the performance of reading and writing data, allowing users to manage large datasets efficiently.
Parquet uses smart techniques like dictionary and run-length encoding to save space. These methods reduce the amount of data stored and speed up the reading process by minimizing the data that needs to be scanned.

The Overview Of Apache Spark

879 implied HN points • 07 Sep 24

🕹 Technology Data processing Software Engineering Distributed Systems Open Source Cloud Computing

Apache Spark is a powerful tool for processing large amounts of data quickly. It does this by using many computers to work on the data at the same time.
A Spark application has different parts, like a driver that directs processing and executors that do the work. This helps organize tasks and manage workloads efficiently.
The main data unit in Spark is called RDD, which stands for Resilient Distributed Dataset. RDDs are important because they make data processing flexible and help recover data if something goes wrong.

I spent 6 hours learning how Apache Spark plans the execution for us

659 implied HN points • 10 Sep 24

🕹 Technology Data science Software Engineering Big Data Cloud Computing Machine Learning

Apache Spark uses a system called Catalyst to plan and optimize how data is processed. This system helps make sure that queries run as efficiently as possible.
In Spark 3, a feature called Adaptive Query Execution (AQE) was added. It allows the tool to change its plans while a query is running, based on real-time data information.
Airbnb uses this AQE feature to improve how they handle large amounts of data. This lets them dynamically adjust the way data is processed, which leads to better performance.

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

399 implied HN points • 17 Sep 24

🕹 Technology Data Engineering Cloud Computing Database Systems Big Data

Metadata is really important because it helps organize and access data efficiently. It tells systems where files are and which ones can be ignored during processing.
Google's BigQuery uses a unique system to manage metadata that allows for quick access and analysis of huge datasets. Instead of putting metadata with the data, it keeps them separate but organized in a smart way.
The way BigQuery handles metadata improves performance by making sure that only the relevant information is accessed when running queries. This helps save time and resources, especially with very large data sets.

Kubernetes for Data Engineers

859 implied HN points • 03 Sep 24

🕹 Technology Data Engineering Cloud Computing DevOps Software Development Infrastructure

Kubernetes is a powerful tool for managing containers, which are bundles of apps and their dependencies. It helps you run and scale many containers across different servers smoothly.
Understanding how Kubernetes works is key. It compares the actual state of your application with the desired state to make adjustments, ensuring everything runs as expected.
To start with Kubernetes, begin small and simple. Use local tools for practice, and learn step-by-step to avoid feeling overwhelmed by its many components.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

I spent 5 hours learning how Google lets us build a Lakehouse.

139 implied HN points • 24 Sep 24

🕹 Technology Cloud Computing Data Engineering Software Development Information Storage

Google's BigLake allows users to access and manage data across different storage solutions like BigQuery and object storage. This makes it easier to work with big data without needing to move it around.
The Storage API enhances BigQuery by letting external tools like Apache Spark and Trino directly access its stored data, speeding up the data processing and analysis.
BigLake tables offer strong security features and better performance for querying open-source data formats, making it a more robust option for businesses that need efficient data management.

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

279 implied HN points • 14 Sep 24

🕹 Technology Data Engineering Big Data Cloud Computing Data Management Data Analytics

Uber evolved from simple data management with MySQL to a more complex system using Hadoop to handle huge amounts of data efficiently.
They faced challenges with data reliability and latency, which slowed down their ability to make quick decisions.
Uber introduced a system called Hudi that allowed for faster updates and better data management, helping them keep their data fresh and accurate.

How do we run Kafka 100% on the object storage?

519 implied HN points • 27 Aug 24

🕹 Technology Software Cloud Computing Data Engineering

AutoMQ enables Kafka to run entirely on object storage, which improves efficiency and scalability. This design removes the need for tightly-coupled compute and storage, allowing more flexible resource management.
AutoMQ uses a unique caching system to handle data, which helps maintain fast performance for both recent and historical data. It has separate caches for immediate and long-term data needs, enhancing read and write speeds.
Reliability in AutoMQ is ensured through a Write Ahead Log system using AWS EBS, which helps recover data after crashes. This setup allows for fast failover and data persistence, so no messages get lost.

I spent 4 hours learning Apache Iceberg. Here's what I found.

799 implied HN points • 10 Aug 24

🕹 Technology Data Engineering Software Development Database Management Big Data Cloud Computing

Apache Iceberg is a table format that helps manage data in a data lake. It makes it easier to organize files and allows users to interact with data without worrying about how it's stored.
Iceberg has a three-layer architecture: data, metadata, and catalog, which work together to track and manage the actual data and its details. This structure allows for efficient querying and data operations.
One cool feature of Iceberg is its ability to time travel, meaning you can access previous versions of your data. This lets you see changes and retrieve earlier data as needed.

I spent 7 hours diving deep into Apache Iceberg

339 implied HN points • 31 Aug 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Database Management

Apache Iceberg organizes data into a data layer and a metadata layer, making it easier to manage large datasets. The data layer holds the actual records, while the metadata layer keeps track of those records and their changes.
Iceberg's manifest files help improve read performance by storing statistics for multiple data files in one place. This means the reader can access all needed statistics without opening each individual data file.
Hidden partitioning in Iceberg allows users to filter data without needing extra columns, saving space. It records transformations on columns instead, helping streamline queries and manage data efficiently.

How did Discord evolve to handle trillions of data points

399 implied HN points • 20 Aug 24

🕹 Technology Data Engineering Software Tools Infrastructure Open Source Data Analytics

Discord started with its own tool called Derived to manage data, but it found this system limited as it grew. They needed a better way to handle complex data tasks.
They switched to using popular tools like Dagster and dbt. This helped them automate and better manage their data processes.
With the new setup, Discord can now make changes quickly and safely, which improves how they analyze and use their vast amounts of data.

How does Notion handle 200 billion data entities?

519 implied HN points • 06 Aug 24

🕹 Technology Data Engineering Database Management Analytics Machine Learning

Notion uses a flexible block system, letting users customize how they organize their notes and projects. Each block can be changed and moved around, making it easy to create what you need.
To manage the huge amount of data, Notion shifted from a single database to a more complex setup with multiple shards and instances. This change helps them handle stronger user demands and analytics needs more efficiently.
By creating an in-house data lake, Notion saved a lot of money and improved data processing speed. This new system allows them to quickly get data from their main database for analytics and support new features like AI.

How did Facebook design their Real-Time Processing ecosystem

279 implied HN points • 17 Aug 24

🕹 Technology Data Engineering Real-Time Processing System Design Software Architecture Big Data

Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.

How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

299 implied HN points • 13 Aug 24

🕹 Technology Data Engineering Infrastructure Software Development Open Source

LinkedIn uses Apache Kafka to manage a massive flow of information, handling around 7 trillion messages every day. They set up a complex system of clusters and brokers to ensure everything runs smoothly.
To keep everything organized, LinkedIn has a tiered system where data is processed locally in each data center, then sent to an aggregate cluster. This helps them avoid issues from moving data across different locations.
LinkedIn has an auditing tool to make sure all messages are tracked and nothing gets lost during transmission. This helps them quickly identify any problems and fix them efficiently.

Netflix Data Engineer Stack

359 implied HN points • 30 Jul 24

🕹 Technology Data Engineering Software Tools Streaming Analytics Infrastructure

Netflix's data engineering stack uses tools like Apache Iceberg and Spark for building batch data pipelines. This helps them transform and manage large amounts of data efficiently.
For real-time data processing, Netflix relies on Apache Flink and a tool called Keystone. This setup makes it easier to handle streaming data and send it where it needs to go.
To ensure data quality and scheduling, Netflix has developed tools like the WAP pattern for auditing data and Maestro for managing workflows. These tools help keep the data process organized and reliable.

Diving Deep into LinkedIn's Data Infrastructure: My 6-Hour Learning & Key Takeaways

299 implied HN points • 03 Aug 24

🕹 Technology Data Engineering Software Architecture Databases Distributed Systems Cloud Computing

LinkedIn's data infrastructure is organized into three main tiers: data, service, and display. This setup helps the system to scale easily without moving data around.
Voldemort is LinkedIn's key-value store that efficiently handles high-traffic queries and allows easy scaling by adding new nodes without downtime.
Databus is a change data capture system that keeps LinkedIn's databases synchronized across applications, allowing for quick updates and consistent data flow.

Apache Kafka - Overview

539 implied HN points • 06 Jul 24

🕹 Technology Data Engineering Software Development Systems Architecture Distributed Systems

Apache Kafka is a system for handling large amounts of data messages, making it easier for companies like LinkedIn to manage and analyze user activity and other important metrics.
In Kafka, messages are organized into topics and divided into partitions, allowing for better performance and scalability. This way, different servers can handle parts of the data at once.
Kafka uses a pull model for consumers, meaning they can request data as they need it. This helps prevent overwhelming the consumers with too much data at once.

Practical Data Engineering using AWS Cloud Technologies

339 implied HN points • 23 Jul 24

🕹 Technology Cloud Computing Data Engineering Software Development Information Systems

AWS offers a variety of tools for data engineering like S3, Lambda, and Step Functions, which can help anyone build scalable projects. These tools are often underused compared to newer options but are still very effective.
Services like SNS and SQS can help manage data flow and processing. SNS allows for publishing messages while SQS aids in handling high event volumes asynchronously.
Using AWS for data engineering is often simpler than switching to modern tools. It's easier to add new AWS services to your existing workflow than to migrate to something completely new.

Apache Kafka - Important Designs

259 implied HN points • 13 Jul 24

🕹 Technology Data Engineering Software Design Systems Architecture Distributed Systems Programming

Kafka uses the operating system's filesystem to store data, which helps it run faster by leveraging the page cache. This avoids the need to keep too much data in memory, making it simpler to manage.
The way Kafka reads and writes data is done in a sequential order, which is more efficient than random access. This design improves performance, as accessing data in a sequence reduces delays.
Kafka groups messages together before sending them, which helps reduce the number of requests made to the system. This batching process improves performance by allowing larger, more efficient data transfers.

Apache Kafka - Producer

199 implied HN points • 20 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Distributed Systems Real-Time Processing

Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.

GroupBy #42: Paypal - Scaling Kafka

219 implied HN points • 02 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Infrastructure

PayPal operates a massive Kafka system with over 85 clusters and handles around 1.3 trillion messages daily. They manage data growth by using multiple geographical data centers for efficiency.
To improve user experience and security, PayPal developed tools like the Kafka Config Service for easier broker management and added access control lists to restrict who can connect to their Kafka clusters.
PayPal focuses on automation and monitoring, implementing systems to quickly patch vulnerabilities and manage topics, while also optimizing metrics to quickly identify issues with their Kafka platform.

4 Trillion Events Daily at LinkedIn

319 implied HN points • 08 Jun 24

🕹 Technology Data Engineering Real-Time Processing Machine Learning Software Development Cloud Computing

LinkedIn processes around 4 trillion events every day, using Apache Beam to unify their streaming and batch data processing. This helps them run pipelines more efficiently and save development time.
By switching to Apache Beam, LinkedIn significantly improved their performance metrics. For example, one pipeline's processing time went from over 7 hours to just 25 minutes.
Their anti-abuse systems became much faster with Beam, reducing the time taken to identify abusive actions from a day to just 5 minutes. This increase in efficiency greatly enhances user safety and experience.

Apache Kafka - Consumer

119 implied HN points • 27 Jul 24

🕹 Technology Data Engineering Software Development Information Systems Data processing Cloud Computing

Kafka uses a pull model for consumers, allowing them to control the message retrieval rate. This helps consumers manage workloads without being overwhelmed.
Consumer groups in Kafka let multiple consumers share the load of reading from topics, but each partition is only read by one consumer at a time for efficient processing.
Kafka handles rebalancing when consumers join or leave a group. This can be done eagerly, stopping all consumers, or cooperatively, allowing ongoing consumption from unaffected partitions.

How Twitter processes 4 billion events in real-time daily

339 implied HN points • 25 May 24

🕹 Technology Data Engineering Real-Time Processing Cloud Computing Data architecture Big Data

Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.

How does Uber build real-time infrastructure to handle petabytes of data every day?

659 implied HN points • 23 Mar 24

🕹 Technology Data Engineering Infrastructure Real-Time Processing Open Source Big Data

Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.

GroupBy #43: Uber | Kafka - The Tiered Storage

139 implied HN points • 09 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Information Systems Big Data

Uber recently introduced Kafka Tiered Storage, which allows storage and compute resources to work separately. This means you can add storage without needing to upgrade processing power.
The tiered storage system has two parts: local storage for fast access and remote storage for long-term data. This setup helps manage data efficiently and keeps the local storage less cluttered.
When you need older data, it can be accessed directly from the remote storage, allowing faster performance for applications that need quick access to recent messages.

GroupBy #44: Meta | The Data Stack

119 implied HN points • 16 Jul 24

🕹 Technology Data Engineering Infrastructure Data Analytics Software Development Real-Time Processing

Meta uses a complex data warehouse to manage millions of tables and keeps data only as long as it's needed. Data is organized into namespaces for efficient querying.
They built tools like iData for data discovery and Scuba for real-time analytics. These tools help engineers find and analyze data quickly.
Data engineers at Meta develop pipelines mainly with SQL and Python, using internal tools for orchestration and monitoring to ensure everything runs smoothly.

Do we need the Lakehouse architecture?

399 implied HN points • 20 Apr 24

🕹 Technology Data architecture Data Management Machine Learning Analytics

Lakehouse architecture combines the strengths of data lakes and data warehouses. It aims to solve the problems that arise from keeping these two systems separate.
This new approach allows for better data management, including features like ACID transactions and efficient querying of big datasets. It enables real-time analytics on raw data without needing complex data movements.
With the help of technologies like Delta Lake and similar systems, the Lakehouse can handle both structured and unstructured data efficiently, making it a promising solution for modern data needs.

GroupBy #40: Data Infrastructure at Airbnb

179 implied HN points • 18 Jun 24

🕹 Technology Data Engineering Software Development Infrastructure Open Source Scalability

Airbnb focuses on using open-source tools and contributing back to the community. This helps them build a strong and collaborative data infrastructure.
Their data infrastructure prioritizes scalability and uses specific clusters for different types of jobs. This approach ensures that critical tasks run efficiently without overwhelming the system.
Airbnb has improved their data processing performance significantly, reducing costs while increasing speed. This was achieved through careful planning and migration of their Hadoop clusters.

How does Uber handle petabytes of Spark shuffle data every day?

159 implied HN points • 22 Jun 24

🕹 Technology Data Engineering Big Data Cloud Computing Software Development Distributed Systems

Uber uses a Remote Shuffle Service (RSS) to handle large amounts of Spark shuffle data more efficiently. This means data is sent to a remote server instead of being saved on local disks during processing.
By changing how data is transferred, the new system helps reduce failures and improve the lifespan of hardware. Now, servers can handle more jobs without crashing and SSDs last longer.
RSS also streamlines the process for the reduce tasks, as they now only need to pull data from one server instead of multiple ones. This saves time and resources, making everything run smoother.

The Hadoop Distributed File System

259 implied HN points • 18 May 24

🕹 Technology Data Storage Cloud Computing Big Data Software Architecture

Hadoop Distributed File System (HDFS) is great for managing large amounts of data across many servers. It ensures data is stored reliably and can be accessed quickly.
HDFS uses a NameNode that keeps track of where data is stored and multiple DataNodes that hold actual data copies. This design helps with data management and availability.
Replication is key in HDFS, as it keeps multiple copies of data across different nodes to prevent loss. This makes HDFS robust even if some servers fail.

The Architecture of Apache Druid

139 implied HN points • 15 Jun 24

🕹 Technology Data Engineering Data architecture Big Data

Apache Druid is built to handle real-time analytics on large datasets, making it faster and more efficient than Hadoop for certain tasks.
Druid uses a variety of node types—like real-time, historical, broker, and coordinator nodes—to manage data, process queries, and ensure everything runs smoothly.
The architecture allows for quick data retrieval while maintaining high availability and performance, making it a strong choice for applications that need fast, interactive data exploration.

GroupBy #41: Uber’s Batch Data Infrastructure with Google Cloud Platform

99 implied HN points • 25 Jun 24

🕹 Technology Data Engineering Cloud Computing Machine Learning Infrastructure Analytics

Uber is moving its huge amount of data to Google Cloud to keep up with its growth. They want a smooth transition that won't disrupt current users.
They are using existing technologies to make sure the change is easy. This includes tools that will help keep data safe and accessible during the move.
Managing costs is a big concern for Uber. They plan to track and control spending carefully as they switch to cloud services.

I spent 5 hours understanding more about the Delta Lake table format

179 implied HN points • 04 May 24

🕹 Technology Data Engineering Database Performance optimization Software Development

Delta Lake is designed to solve problems with traditional cloud object storage. It provides ACID transactions, making data operations like updates and deletions safe and reliable.
Using Delta Lake, data is stored in Apache Parquet format, allowing for efficient reading and writing. The system tracks changes through a transaction log, which keeps everything organized and easy to manage.
Delta Lake supports advanced features like time travel, allowing users to see and revert to past versions of data. This makes it easier to recover from mistakes and manage data over time.

GroupBy #38: Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform, Apache Iceberg - What Is It

119 implied HN points • 04 Jun 24

🕹 Technology Data Engineering Cloud Computing Data Infrastructure Machine Learning Open Source

Uber is upgrading its data system by moving from its huge Hadoop setup to Google Cloud Platform for better efficiency and performance.
Apache Iceberg is an important tool for managing data efficiently, and it can help create a more organized data environment.
Building data products requires a strong foundation in data engineering, which includes understanding the tools and processes involved.

If I could travel back to 5 years ago, what would I talk to myself about Docker?

299 implied HN points • 09 Mar 24

🕹 Technology Software Development Engineering Infrastructure Cloud

Docker helps you package your applications and everything they need into containers. This makes it easier to deploy and run your apps anywhere.
Containers are lighter than virtual machines because they share the host's operating system, saving resources and simplifying management.
To get started with Docker, install it, then run a simple command to create your first container, like 'docker run hello-world' - it’s that straightforward!

Procella - The query engine at YouTube

79 implied HN points • 29 Jun 24

🕹 Technology Data Engineering Cloud Computing Database Systems Analytics

YouTube built Procella to combine different data processing needs into one powerful SQL query engine. This means they can handle many tasks, like analytics and reporting, without needing separate systems for each task.
Procella is designed for high performance and scalability by keeping computing and storage separate. This makes it faster and more efficient, allowing for quick data access and analysis.
The engine uses clever techniques to reduce delays and improve response times, even when many users are querying at once. It constantly optimizes and adapts, making sure users get their data as quickly as possible.

GroupBy #36: Agoda- How We Solve Load Balancing Challenges in Apache Kafka, How to reduce your Snowflake cost

139 implied HN points • 21 May 24

🕹 Technology Data Engineering Software Development Cloud Computing Infrastructure Cost Optimization

Working on pet projects is fun, but it's important to have clear learning goals to actually gain knowledge from them.
When using tools like Spark or Airflow, always ask what problem they solve to understand their value better.
To make your projects more effective, think like a user and check if they get what they need from your data systems.

All you need to know about the Google File System

119 implied HN points • 11 May 24

🕹 Technology Data Systems Distributed Computing Systems Design Fault Tolerance

Google File System (GFS) is designed to handle huge files and many users at once. Instead of overwriting data, it mainly focuses on adding new information to files.
The system uses a single master server to manage file information, making it easier to keep track of where everything is stored. Clients communicate directly with chunk servers for faster data access.
GFS prioritizes reliability by storing multiple copies of data on different chunk servers. It constantly checks for errors and can quickly restore lost or corrupted data from healthy replicas.

GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta

59 implied HN points • 11 Jun 24

🕹 Technology Data Engineering Software Development Cloud Computing Analytics Data science

Meta has developed a serverless Jupyter Notebook platform that runs directly in web browsers, making data analysis more accessible.
Airflow is being used to manage over 2000 DBT models, which helps teams create and maintain their own data models effectively.
Building a data platform from scratch can be a valuable learning experience, revealing important lessons about data structure and management.