VuTrinh.

VuTrinh covers comprehensive analyses and tutorials on various data engineering tools and technologies like Parquet, Apache Spark, Apache Kafka, Kubernetes, and data architectures from big tech firms. The posts highlight concepts, implementations, performance enhancements, and best practices in managing large datasets and real-time data processing.

Data Formats Data Processing Container Management Real-Time Data Processing Data Infrastructure Cloud Technologies Data Storage Data Management

The hottest Substack posts of VuTrinh.

And their main takeaways

GroupBy #21: How to design resilient and large scale data systems, What Data Modeling is NOT

0 implied HN points • 06 Feb 24

Designing data systems requires resilience and scalability, which means they should handle growth and failures efficiently.
Data modeling is more than just making diagrams; it's about understanding the entire system and how data flows within it.
Using tools like DuckDB in the browser can open up new possibilities for data processing, making it more accessible and flexible.

Coming soon

0 implied HN points • 06 Sep 23

🎭️ Culture Art Music Literature Food Fashion

New content is on the way, so stay tuned!
You can subscribe to keep updated on the latest posts.
Sharing is easy with options for Facebook and email.

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

0 implied HN points • 28 Nov 23

🕹 Technology Data Engineering Artificial Intelligence Machine Learning Software Development Cloud Computing

Meta is working on improving how developers use Python, making it smoother with better tools like a new linter.
Netflix has built a system for processing data incrementally using Apache Iceberg, which helps manage and update data efficiently.
There are free courses available from Microsoft and Google Cloud that teach the basics of Generative AI, helping anyone to get started in this exciting field.

GroupBy #24: Enabling near real-time data analytics on the data lake at Grab, Aligning Velox and Apache Arrow at Meta.

0 implied HN points • 27 Feb 24

🕹 Technology Data Engineering Cloud Computing Analytics Machine Learning Software Development

Grab is working on letting users analyze data quickly with their new approach to data lakes. This helps businesses get insights much faster.
Meta is aligning Velox and Apache Arrow to improve data management. This should make it easier to handle and analyze large amounts of data.
PayPal is using Spark 3 and NVIDIA's GPUs to cut their cloud costs by up to 70%. This helps them process a lot of data without spending too much money.

GroupBy #22: Data Engineering Landscape in 2024, how I scaled my $1m/year revenue startup's data model

0 implied HN points • 13 Feb 24

🕹 Technology Data Engineering AI Machine Learning Software Development Startups

The data engineering field is evolving, and it's important to understand the upcoming trends that will impact how we work with data.
Creating a simple and efficient data model is key for startups, but as they grow, it's crucial to adapt and scale the data model to meet new demands.
Learning SQL remains essential, as it is still a fundamental tool in data manipulation, making it important for anyone in the data field to master.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

GroupBy #20: How Google takes the pain out of code reviews, The Difficulties of Senior Engineer are not Engineering

0 implied HN points • 30 Jan 24

🕹 Technology Engineering Data AI Software Cloud

Google has tools and guidelines that make code reviews easier and more satisfying for developers.
Senior engineers often face challenges that go beyond just coding, like team dynamics and communication.
Improving your skills in data engineering or management can keep your career moving forward.

GroupBy #19: How Apple built iCloud to store billions of databases, Palette-Uber feature store, Definition of Data Modeling

0 implied HN points • 23 Jan 24

🕹 Technology Data Engineering Machine Learning Cloud Computing Software Development AI Tools

Apple uses special databases like Cassandra and FoundationDB to manage iCloud's huge storage system. This helps them keep track of billions of databases effectively.
Uber created a feature store called Palette that helps in managing data for machine learning projects. It collects and organizes useful features for easy access by developers.
Data modeling is a key concept that defines how data is organized and related in a system. Different experts might have varying definitions, showing the complexity of the topic.

GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

0 implied HN points • 26 Dec 23

🕹 Technology Data Engineering Software Development Machine Learning Infrastructure Programming

Meta created a strong infrastructure for Threads to handle massive user growth right after its launch. This enabled over 100 million sign-ups in just five days.
Notion's data infrastructure had to evolve to keep up with its rapid growth and new product uses. This involved significant changes to manage their increasing data scale.
The 'Grokking Concurrency' book is a helpful resource for learning about concurrent programming. It makes complex topics easier to understand with clear examples.

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

0 implied HN points • 21 Nov 23

🕹 Technology Data Engineering Distributed Systems

Netflix's Psyberg is a new way for processing data that helps manage membership information better. It uses innovative methods to make data processing more efficient.
The Parquet format is great for storing data because it organizes information in a smart way. It can improve how quickly and easily data is accessed and processed.
SQL isn't the best tool for doing analytics because it was designed a long time ago. There are newer tools that fit analytics needs much better.

GroupBy #2

0 implied HN points • 15 Sep 23

🕹 Technology Data Engineering AI Data science Cloud Computing Software Development

The Lakehouse concept combines the best features of data lakes and data warehouses. It's a new way to manage and analyze data effectively.
Good data quality is essential for making AI work. If the data is bad, the results will also be poor.
AI tools might help data teams work more efficiently, but they won't reduce the demand for data professionals. In fact, they might increase it.

GroupBy #3

0 implied HN points • 22 Sep 23

🕹 Technology Data Engineering Big Data Cloud Computing Software Development Data Management

Docker commands can be simplified with a cheat sheet, making it easier for developers to use container technologies effectively.
Apache Spark was created at UC Berkeley to improve cluster computing, focusing on faster interactive computations than previous systems like Hadoop.
There are key differences between HDFS and S3, especially in how they handle data, and many people confuse them even though they serve different purposes.

GroupBy #4: Polars and Pandas, 1.8 trillion events, data quality

0 implied HN points • 10 Oct 23

🕹 Technology Data Engineering Data Quality Big Data Machine Learning Cloud Computing

Polars and Pandas are tools for data processing, but they have different performance levels. Understanding when to use each can help manage large datasets better.
Data quality is crucial for successful data engineering. Companies like Google and Uber have strategies in place to ensure their data is accurate and reliable.
Learning SQL execution order can really help in data tasks. It outlines the steps SQL takes to process a query, which is key for optimizing database interactions.

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

0 implied HN points • 06 Nov 23

🕹 Technology Data Engineering Data Modeling Data science Software Development Big Data

The Parquet file format is becoming popular for data storage because it is efficient and works well with big data tools. Understanding how to use it can help data engineers be more effective.
Data engineering is evolving, and new trends like data mesh are changing how data platforms are built. Keeping up with these changes is important for anyone in the field.
Starting a small data engineering project can be a great way to learn new skills. Even a quick project can teach you important techniques, like web scraping and using cloud storage.