VuTrinh.

VuTrinh covers comprehensive analyses and tutorials on various data engineering tools and technologies like Parquet, Apache Spark, Apache Kafka, Kubernetes, and data architectures from big tech firms. The posts highlight concepts, implementations, performance enhancements, and best practices in managing large datasets and real-time data processing.

Data Formats Data Processing Container Management Real-Time Data Processing Data Infrastructure Cloud Technologies Data Storage Data Management

The hottest Substack posts of VuTrinh.

And their main takeaways

GroupBy #37: Composable data management at Meta, How Uber Accomplishes Job Counting At Scale

59 implied HN points • 28 May 24

When learning something new, it's good to start by asking yourself why you want to learn it. This helps set clear goals and expectations.
Focusing on one topic at a time can make learning easier. Instead of spreading your time thin, dive deep into one subject.
It's okay to feel stuck sometimes while learning. Just keep pushing through, relax, and remember that learning is a journey that takes time.

Why did Databricks build the Photon engine?

99 implied HN points • 06 Apr 24

🕹 Technology Data Engineering Software Development Cloud Computing Database Systems Big Data

Databricks created the Photon engine to simplify data management by combining the benefits of data lakes and data warehouses into one system called the Lakehouse. This makes it easier and cheaper for companies to manage their data all in one place.
Photon is designed to handle various types of raw data and is built with a vectorized approach instead of the traditional Java methods. This means it can work faster and better with different kinds of data without getting bogged down.
To ensure that existing customers using Apache Spark can easily switch to Photon, the new engine is integrated with Spark’s system. This allows users to continue running their current processes while benefiting from the speed of Photon.

A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn

99 implied HN points • 30 Mar 24

🕹 Technology Data Engineering Distributed Systems

Apache Pinot is a real-time OLAP system developed by LinkedIn that allows for fast analytics on large sets of data. It can handle tens of thousands of analytical queries per second while providing near-instant results.
The architecture is divided into key components like controllers, brokers, and servers which work together to process queries and manage data efficiently. Pinot is designed to quickly ingest and query fresh data from various sources, ensuring low latency.
Pinot supports various indexing strategies, like star-tree indexes, to optimize complex queries. This enables faster query responses by pre-aggregating data, making it easier to analyze large volumes of information.

I spent 3 hours figuring out how BigQuery inserts, deletes and updates data internally. Here's what I found.

139 implied HN points • 17 Feb 24

🕹 Technology Data Engineering Cloud Computing Database Management Data Analysis Software Development

BigQuery manages data using immutable files, meaning once data is written, it cannot be changed. This helps in processing data efficiently and maintains data consistency.
When you perform actions like insert, delete, or update, BigQuery creates new files instead of changing existing ones. This approach helps in features like time travel, which lets you view past states of data.
BigQuery uses a system called storage sets to handle operations. These sets help ensure processes are performed atomically and consistently, maintaining data integrity during changes.

GroupBy #35: The Netflix Data Engineering Stack, Atlassian - Evolve the data platform with a Deployment Capability

59 implied HN points • 14 May 24

🕹 Technology Data Engineering Software Development Cloud Computing Data architecture Programming

Netflix has a strong data engineering stack that supports both batch and streaming data pipelines. It focuses on building flexible and efficient data architectures.
Atlassian has revamped its data platform to include a new deployment capability inspired by technologies like Kubernetes. This helps streamline their data management processes.
Migrating from dbt Cloud can teach valuable lessons about data development. Companies should explore different options and learn from their migration journeys.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

I spent 6 hours understanding the design principles of BigQuery. Here's what I found

159 implied HN points • 20 Jan 24

🕹 Technology Data Engineering Cloud Computing Architecture Big Data

BigQuery uses SQL again after moving away from it, making data analysis fast and easy. Users can now analyze huge datasets quickly without complex coding.
It separates storage and compute resources, allowing for better performance and flexibility. This means you can scale them independently, which is very efficient.
Dremel's serverless architecture means you don’t need to manage servers. You just use SQL, and everything else is automatically handled for you.

A Closer Look Into Databricks's Photon Engine

79 implied HN points • 13 Apr 24

🕹 Technology Software Data Engineering Database Systems Cloud Computing Artificial Intelligence

Photon engine uses columnar data layout to manage memory efficiently, allowing it to process data in batches. This helps in speeding up data operations.
It supports adaptive execution, which means the engine can change how it processes data based on the input. This can significantly improve performance, especially when data has many NULLs or inactive rows.
Photon integrates with Databricks runtime and Spark SQL, allowing it to enhance existing workloads without completely replacing the old system, making transitions smoother.

GroupBy #34: Hybrid Transactional/Analytical Storage, From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

59 implied HN points • 07 May 24

🕹 Technology Data Engineering Artificial Intelligence Machine Learning Data Management Cloud Computing

Hybrid transactional/analytical storage combines different types of data processing. This helps companies like Uber manage their data more efficiently.
The shift from predictive to generative AI is changing how companies use machine learning. Uber's Michelangelo platform shows how this new approach can improve AI applications.
Data reliability and observability are important for businesses as their data grows. Companies need tools to quickly find and fix data issues to keep their operations running smoothly.

How Rust and Python manage memory

119 implied HN points • 27 Jan 24

🕹 Technology Programming Software Data Engineering Memory management System Design

Rust uses ownership to manage memory, meaning each value has a single owner. When that owner goes out of scope, the memory gets freed automatically.
Python uses a garbage collector to handle memory which counts how many references point to an object. Once there are no references left, it cleans up the unused memory.
Rust's approach gives developers more control but requires them to understand ownership rules, while Python's method is easier for beginners but can slow down performance.

I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.

79 implied HN points • 16 Mar 24

🕹 Technology Data Engineering Cloud Computing Database Systems Machine Learning Big Data

Amazon Redshift is designed as a massively parallel processing data warehouse in the cloud, making it effective for handling large data sets efficiently. It changes how data is stored and queried compared to traditional systems.
The system uses a unique compilation service that generates specific code for queries, which helps speed up processing by caching compiled code. This means Redshift can reuse code for similar queries, reducing wait times.
Redshift also uses machine learning techniques to optimize operations, such as predicting resource needs and automatically adjusting performance settings. This allows it to scale effectively and maintain high performance during heavy workloads.

GroupBy #31: Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore, Grab Experiment Decision Engine

59 implied HN points • 16 Apr 24

🕹 Technology Data Engineering Machine Learning Software Development Cloud Computing

Uber successfully migrated over a trillion entries of its ledger data to a new database called LedgerStore without causing disruptions. This shows how careful planning can make big data moves smooth.
Airbnb has open-sourced a machine learning feature platform called Chronon, which helps manage data and makes it easier for engineers to work with different data sources. This promotes collaboration and innovation in the tech community.
The GrabX Decision Engine boosts experimentation on online platforms by providing tools for better planning and analyzing experiments. This can lead to more informed decisions and improved outcomes in projects.

I spent 5 hours learning how ClickHouse built their internal data warehouse.

1 HN point • 21 Sep 24

🕹 Technology Data Engineering Cloud Computing Database Management Data Warehousing

ClickHouse built its internal data warehouse to better understand customer usage and improve its services. They collected data from multiple sources to gain valuable insights.
They use tools like Airflow for scheduling and Superset for data visualization, making their data processing efficient. This setup allows them to handle large volumes of data daily.
Over time, ClickHouse evolved its system by adding dbt for data transformation and improving user experiences with better SQL query tools. They also incorporated real-time data to enhance their reporting.

BigQuery processing engine: Shuffle

119 implied HN points • 06 Jan 24

🕹 Technology Data Engineering Big Data Cloud Computing Data processing Analytics

BigQuery uses a processing engine called Dremel, which takes inspiration from how MapReduce handles data. It improves how data is shuffled between workers for faster processing.
Traditional approaches have issues like resource fragmentation and unpredictable scaling when dealing with huge data. Dremel solves this by managing shuffle storage separately from the worker, which helps in scaling and resource management.
By separating the shuffle layer, Dremel reduces latency, improves fault tolerance, and allows for more flexible worker allocation during execution. This makes it easier to handle larger data sets efficiently.

I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.

79 implied HN points • 02 Mar 24

🕹 Technology Data Engineering Cloud Computing System Architecture Database Design

Snowflake has a unique design with three main layers: storage, virtual warehouse, and cloud service. This structure helps manage data efficiently and ensures high availability.
The system uses a special ephemeral storage for temporary data during queries, which allows for quick access and less strain on the overall system. This helps with performance and reduces network load.
Snowflake is designed for flexibility, allowing it to adapt resources based on customer needs and workloads. This elasticity helps provide better performance and efficiency.

GroupBy #29: Scaling AI/ML Infrastructure at Uber, The Sisyphean struggle and the new era of data infrastructure

59 implied HN points • 02 Apr 24

🕹 Technology Data Engineering Machine Learning Infrastructure Software Development Cloud Computing

Uber is focusing on building strong AI and machine learning infrastructure to keep up with the growing complexity of their models. This involves using both CPUs and GPUs for better efficiency.
Data management is becoming crucial for companies like Netflix as they deal with massive amounts of production data. They are developing tools to effectively manage and optimize this data.
The data streaming landscape is evolving, with new technologies emerging that make handling data easier and more efficient. This is changing how companies approach data infrastructure.

I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.

79 implied HN points • 24 Feb 24

🕹 Technology Data Engineering Database Systems Cloud Computing Big Data Software Development

BigQuery processes SQL queries by planning, optimizing, and executing them. It starts by validating the query and creating an efficient execution plan.
The query execution uses a dynamic tree structure that adjusts based on data characteristics. This helps to manage different types of queries more effectively.
Key components of BigQuery include the Query Master for planning, the Scheduler for assigning resources, and Worker Shards that carry out the actual computations.

GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive

59 implied HN points • 26 Mar 24

🕹 Technology Data Engineering Software Development Machine Learning Cloud Computing Big Data

Tableflow allows you to easily turn Apache Kafka topics into Iceberg tables, which could change how streaming data is managed.
Kafka's new tiered storage feature helps separate compute and storage, making it easier to manage resources and keep systems running smoothly.
Data governance is important but can be lackluster if it doesn't show clear business benefits, making us rethink its role in today's data landscape.

I spent another 6 hours understanding the design principles of Snowflake. Here's what I found

79 implied HN points • 10 Feb 24

🕹 Technology Data Engineering Cloud Computing Software Architecture Database Systems Data Analytics

Snowflake separates storage and compute, allowing for flexible scaling and improved performance. This means that data storage can grow separately from computing power, making it easier to manage resources.
Data can be stored in a cloud-based format that supports both structured and semi-structured data. This flexibility allows users to easily handle various data types without needing to define a strict schema.
Snowflake implements unique optimization techniques, like data skipping and a push-based query execution model, which enhance performance and efficiency when processing large amounts of data.

The stream processing model behind Google Cloud Dataflow

39 implied HN points • 27 Apr 24

🕹 Technology Data processing Cloud Computing Big Data Software Engineering Stream Processing

Google Cloud Dataflow is a service that helps process both streaming and batch data. It aims to ensure correct results quickly and cost-effectively, useful for businesses needing real-time insights.
The Dataflow model separates the logical data processing from the engine that runs it. This allows users to choose how they want to process their data while still using the same fundamental tools.
Windowing and triggers are important features in Dataflow. They help organize and manage how data is processed over time, allowing for better handling of events that come in at different times.

GroupBy #30: Uber- How LedgerStore Supports Trillions of Indexes, Composable Data Systems: Lessons from Apache Calcite Success

39 implied HN points • 09 Apr 24

🕹 Technology Data Engineering Data Analysis Software Development Cloud Computing Machine Learning

LedgerStore at Uber can handle trillions of indexes, making it a powerful tool for managing large-scale data efficiently.
Apache Calcite helps build flexible data systems with strong query optimization features, which are vital for many data applications.
Spotify's data platform plays a critical role in their operations, guiding how to build effective data systems in organizations.

GroupBy #26: How GitHub uses merge queue to ship hundreds of changes every day, Data governance in the age of generative AI, "Good Enough" Data Models

39 implied HN points • 12 Mar 24

🕹 Technology Data Engineering AI Software Development Machine Learning Data Governance

GitHub uses a merge queue system that helps them quickly ship many code changes each day. This makes their deployment process faster and more efficient.
Data governance is becoming really important, especially with the rise of generative AI. Companies need to ensure the data used by these systems is accurate and secure.
The idea of 'Good Enough' data models suggests that it's okay to have models that meet basic needs instead of striving for perfection. This approach can save time and resources.

You don't know this for sure: How BigQuery stores semi-structured data?

59 implied HN points • 13 Jan 24

🕹 Technology Data Engineering Cloud Computing Big Data Data Storage Database Management

BigQuery uses a method called definition and repetition level for efficient storage of nested and repeated data. This allows for reading specific parts of data without needing to access other related data.
In columnar storage, data is organized by columns which can improve performance, especially for analytical queries, because only the needed columns are loaded.
Using this method might increase file sizes due to redundancy, but it helps reduce the input/output operations needed when accessing nested fields.

GroupBy #33: Data Gateway - A Platform for Growing and Protecting the Data Tier at Netflix, The Cloud Storage Triad: Latency, Cost, Durability

19 implied HN points • 30 Apr 24

🕹 Technology Data Engineering Cloud Computing Software Development Infrastructure Data Management

Netflix has created a platform called Data Gateway that helps their developers manage data more easily. It simplifies complex database processes so that app developers can focus on coding.
The cloud storage triad talks about balancing latency, cost, and durability when storing data. Choosing the right storage solution can save money while ensuring data is always available.
Managing data ingestion effectively is crucial for companies like RevenueCat. They faced challenges moving their data and found ways to optimize the process for better performance.

GroupBy #32: Canva - Scaling to Count Billions, Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies

19 implied HN points • 23 Apr 24

🕹 Technology Data Engineering Software Development Cloud Computing Artificial Intelligence Database Management

Canva's usage of creator content has skyrocketed, with data showing its growth doubling every 18 months. Managing the architecture to track this data is a significant challenge.
Uber has developed strong testing and monitoring processes for its financial accounting data. This ensures accuracy and presents reliable external financial reports.
With the rise of data lakehouses, utilizing tools like Apache Hudi and Paimon can enhance data storage and performance. These tools help build efficient and scalable data solutions.

GroupBy #27: Balancing HDFS DataNodes in the Uber DataLake, How Figma’s databases team lived to tell the scale

19 implied HN points • 19 Mar 24

🕹 Technology Data Engineering Infrastructure Software Development AI/ML Web Technologies

Balancing your data infrastructure is key for efficiency and reliability. Companies like Uber face challenges in maintaining this balance as they scale up their data needs.
Figma's database team has successfully handled a massive growth in data since 2020, showing that scaling can lead to new technical challenges but also growth opportunities.
Optimizing data pipelines can save significant costs. Techniques to reduce data shuffling in processes like Apache Spark can help make data handling more efficient.

GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

39 implied HN points • 05 Dec 23

🕹 Technology Data Engineering Cloud Computing Machine Learning Software Development Data Analytics

AWS re:Invent 2023 announced new features focused on improving data storage and processing. This includes faster storage options and AI capabilities for better data insights.
Lyft switched from using Druid to ClickHouse for their analytics needs. This change was driven by a need for faster data query responses.
Apache Hudi was created to help manage data in a more efficient way. It enables incremental data processing, making it easier to work with large amounts of information.

GroupBy #25: From Samza to Flink: A Decade of Stream Processing, DoorDash’s In-House Search Engine,Meta's DotSlash, Designing Metrics Trees

19 implied HN points • 05 Mar 24

🕹 Technology Data Engineering Software Development Stream Processing Data Visualization

Stream processing has evolved significantly over the years, with frameworks like Samza and Flink leading the way in handling real-time data streams.
DoorDash developed its own search engine using Apache Lucene, achieving impressive performance improvements, like reduced latency and lower hardware costs.
Understanding metrics trees is essential for businesses as they visually represent how different inputs contribute to outputs, helping in decision-making.

GroupBy #7: The rise of data engineer, levels of abstractions, data modeling

39 implied HN points • 31 Oct 23

🕹 Technology Data Engineering Software Development Machine Learning Data Modeling Cloud Computing

Data engineers are becoming more important in the tech world as they handle vast amounts of data. Their role is focused on building systems that allow for efficient data handling and analysis.
Levels of abstraction in data engineering can be confusing, leading to challenges in understanding systems. It’s important to find a balance between using abstractions and being able to see the underlying processes.
Good data modeling practices can help organizations make better use of their time-series data. Understanding how to structure data effectively is key to unlocking its value.

GroupBy #23: Meta loves Python, How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache

19 implied HN points • 20 Feb 24

🕹 Technology Data Engineering Software Development Data science Artificial Intelligence Cloud Computing

Meta is heavily invested in Python, and they're working on improvements to enhance its performance and usability.
Uber has developed a powerful database called Docstore that can handle over 40 million reads per second, demonstrating their capability in data management.
Data, while useful, doesn't capture the complete reality, and it's important to recognize its limitations in understanding complex scenarios.

I made 1+1=0 in DuckDB

19 implied HN points • 03 Feb 24

🕹 Technology Data Engineering Databases Analytics Software Development Programming

DuckDB is easy to use because it works like SQLite, running directly inside applications without needing a separate server. This makes it simpler to manage.
It processes data in batches through vectorization, which means it can handle multiple records at once, making operations faster than traditional row-by-row processing.
DuckDB supports ACID transactions, ensuring that data remains safe and reliable, which is important in data analytics and shared environments.

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

19 implied HN points • 16 Jan 24

🕹 Technology Data Engineering Artificial Intelligence Cloud Computing Software Development System reliability

Uber improved its Presto reliability by tuning garbage collection. This helps the system run better and more dependably.
Meta is making strides in generative AI, focusing on how it can bring new advancements. The future looks promising for AI technologies.
Python 3.13 introduced a Just-In-Time (JIT) compiler, which could speed up programming processes. This is a beneficial development for Python users.

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

19 implied HN points • 09 Jan 24

🕹 Technology Data Engineering Software Development Database Systems Cloud Computing Big Data

Pinterest has developed a new wide column database using RocksDB for better data handling. This helps them manage large amounts of data more efficiently.
Grab improved Kafka's fault tolerance on Kubernetes, ensuring their real-time data streaming service runs smoothly even when problems occur.
The newsletter will evolve, offering more content types like curated resources on data engineering and personal insights every week.

Referral program and things you can expect from this newsletter

19 implied HN points • 04 Jan 24

🕹 Technology Data Engineering Newsletter Community Building Email Marketing Content creation

There's a referral program where you can refer friends to subscribe and earn gifts as rewards.
You can expect two main types of emails: one that curates valuable data engineering resources and another that shares insights I've learned from others.
You have control over how many emails you receive, so you can choose to get only the ones you want.

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

19 implied HN points • 02 Jan 24

🕹 Technology Data Engineering Machine Learning Software Development Data Analytics Cloud Computing

Uber has developed an anomaly detection system called uVitals, which helps identify issues before they become major problems. It analyzes data patterns to catch anomalies early.
Data modeling is essential for creating structured databases that allow for better analysis and comparisons. It's important for data projects to have clear designs.
As the field of data engineering evolves, new roadmaps and resources are emerging to guide professionals in developing necessary skills. Staying updated can help engineers advance their careers.

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

19 implied HN points • 19 Dec 23

🕹 Technology Data Engineering Software Development Cloud Computing Artificial Intelligence Big Data

To be a Senior Individual Contributor at Meta, focus on quickly adding value and aligning with the organization's goals. It's about making an impact and building good relationships within the team.
Data modeling involves creating a shared understanding between business and data teams. It's essential for delivering valuable insights and ensuring everyone is on the same page.
Job hopping in data engineering can be successful with the right approach. Make sure to deliver value early on and always be ready for new opportunities while enjoying your work-life balance.

GroupBy #13: Explaining Kubernetes To My Uber Driver, Data Modelling For Data Engineers

19 implied HN points • 12 Dec 23

🕹 Technology Data Engineering Software Development Learning Resources Cloud Computing

Kubernetes can be tricky to explain, but using simple analogies can help anyone understand its purpose. It's like managing many containers, just like an Uber driver manages different passengers.
Data modeling is essential for data engineers to organize and structure data effectively. This helps make data more accessible and useful for analysis.
Learning resources, such as free online courses, are available to help you start or improve your skills in data engineering. They cover various important topics for new and experienced data engineers.

GroupBy #6: Meta developer tool, trillions of data points at Discord and Uber's data cycle management

19 implied HN points • 24 Oct 23

🕹 Technology Data Engineering AI Software Development Data Management Systems Architecture

Meta has introduced developer tools that help manage large-scale projects efficiently. These tools assist engineers in solving problems and improving systems.
Big companies like Discord and Uber are using massive data points to create valuable insights. This helps them to effectively manage their data and understand trends better.
Data engineering continues to evolve, with tools like BigQuery and dbt Mesh enhancing data practices. Staying updated with these tools can improve data analysis and management.

GroupBy #5: The story of S3, Kafka at scale and the boring is back

19 implied HN points • 17 Oct 23

🕹 Technology Data Engineering Software Development Cloud Computing Artificial Intelligence Systems Architecture

S3 is a big storage system used for data, and understanding how it's built can help improve data handling. It's cool to know how tech like this works.
Running Kafka at scale is interesting, especially for companies like Pinterest. It shows how important reliable data flow is in tech.
There's a trend of making things simpler and more efficient in engineering. Sometimes, going back to basics can solve complex problems.

GroupBy #1

19 implied HN points • 08 Sep 23

🕹 Technology Data Engineering Architecture AI Quality Assurance Open Source

Kappa architecture simplifies data processing by combining batch and stream processing. This makes handling data more efficient compared to the traditional Lambda architecture.
Presto is a powerful tool for querying large datasets, and Meta has valuable insights on using it effectively. Learning from their experience can help other teams improve their data operations.
Data quality is crucial in analytics, and there are specific metrics to help measure it. Keeping track of these can prevent problems that arise from poor data.

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

0 implied HN points • 14 Nov 23

🕹 Technology Data Engineering Data Analytics Machine Learning Software Development Data Management

The FDAP stack is important in building reliable data systems. It helps to manage data more efficiently by using advanced technologies.
Learning about data quality is crucial. It ensures that the information used for decision-making is accurate and trustworthy.
Data-driven management is all about making decisions based on solid data insights. It helps businesses understand what works and what doesn't.