The hottest Data Engineering Substack posts right now

And their main takeaways

Synthetic Data In A Nutshell

Three Data Point Thursday • 39 implied HN points • 11 Jan 24

🕹 Technology Data Engineering

Synthetic data is fake data that is becoming increasingly practical and valuable.
Generative AI and the growing gap between data demand and availability are driving forces for the usefulness of synthetic data.
Synthetic data is beneficial in various fields beyond just machine learning, offering opportunities for innovation and improvement.

How I use Gen AI as a Data Engineer

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Apr 24

🕹 Technology Data Engineering

Generative AI can help Data Engineers summarize vast quantities of structured and unstructured data, expanding the breadth and depth of data available.
Feature engineering using Generative AI involves ingesting unstructured data like call notes, making API calls, and transforming the data for analysis in existing pipelines.
Utilizing Generative AI for webscraping can help teams extract information efficiently from the internet, enabling monitoring of new data sources and optimizing business processes.

Common Techniques For Periodically Extracting Data From Relational Databases

Practical Data Engineering Substack • 59 implied HN points • 18 Sep 23

🕹 Technology Data Engineering

Data extraction from relational databases is important for building data pipelines. It's key to choose the right method based on factors like the type and size of the data.
There are several techniques to detect changes in the data, including using timestamps and database triggers. These help identify what new or changed records need to be extracted.
Different data extraction patterns exist, like periodic full reloads or incremental updates. Each method has its pros and cons, and the choice depends on the specific data needs and conditions.

How to build a dual Incremental + snapshot data ingestion pipeline

Practical Data Engineering Substack • 59 implied HN points • 01 Oct 23

🕹 Technology Data Engineering

You can improve data accuracy by using two pipelines: one for getting recent updates quickly and another for regularly loading the entire dataset. This helps in keeping the data reliable over time.
It's essential to manage pipeline scheduling based on your business's needs, like how often you need updates. You can choose faster updates or less frequent full reloads depending on how critical the data is.
Using tools like Apache Airflow can help organize these pipelines efficiently. You can simplify tasks by dynamically generating them from a list, making it easier to handle many data tables.

Today’s Top 5 HN posts

The History and Evolution of Open Table Formats - Part I

Practical Data Engineering Substack • 2 HN points • 15 Aug 24

🕹 Technology Data Engineering

Open Table Formats have changed how we store and manage data, making it easier to work with different systems and tools without being locked into one software.
The transition from traditional databases to open table formats has increased flexibility and allowed for better collaboration across various platforms, especially in data lakes.
Despite their advantages, old formats like Hive still face issues like slow performance and over-partitioning, which can make data management challenging as companies grow.

GroupBy #27: Balancing HDFS DataNodes in the Uber DataLake, How Figma’s databases team lived to tell the scale

VuTrinh. • 19 implied HN points • 19 Mar 24

🕹 Technology Data Engineering

Balancing your data infrastructure is key for efficiency and reliability. Companies like Uber face challenges in maintaining this balance as they scale up their data needs.
Figma's database team has successfully handled a massive growth in data since 2020, showing that scaling can lead to new technical challenges but also growth opportunities.
Optimizing data pipelines can save significant costs. Techniques to reduce data shuffling in processes like Apache Spark can help make data handling more efficient.

GroupBy #12: AWS re:Invent 2023, Druid and ClickHouse at Lyft, Apache Hudi History

VuTrinh. • 39 implied HN points • 05 Dec 23

🕹 Technology Data Engineering

AWS re:Invent 2023 announced new features focused on improving data storage and processing. This includes faster storage options and AI capabilities for better data insights.
Lyft switched from using Druid to ClickHouse for their analytics needs. This change was driven by a need for faster data query responses.
Apache Hudi was created to help manage data in a more efficient way. It enables incremental data processing, making it easier to work with large amounts of information.

SDF

davidj.substack • 59 implied HN points • 12 Feb 25

🕹 Technology Data Engineering

SDF and SQLMesh are alternatives to dbt for data transformation. They are both built with modern tech and aim to provide better ease of use and performance.
SDF has a built-in local database, allowing developers to test queries without costs from a cloud data warehouse. This can speed up development and reduce costs.
Both tools offer column-level lineage to track changes, but SQLMesh provides a better workflow for managing breaking changes. SQLMesh also has unique features like Virtual Data Environments that enhance developer experience.

AI Observability, Orchestration, Consolidation

Gradient Flow • 179 implied HN points • 26 May 22

🕹 Technology Data Engineering

Companies are likely to use at most two platforms for managing the entire machine learning pipeline: one for exploration and another for deployment and operations.
Prefect 2.0 is a popular framework for data and workflow orchestration, emphasizing 'code as workflows' to address data engineering challenges.
The survey on workflow orchestration tools revealed a growing interest in these systems, with startups raising over $450 million in funding for orchestration solutions.

Using AI to build a robust testing framework

Inside Data by Mikkel Dengsøe • 24 implied HN points • 11 Jul 25

🕹 Technology Data Engineering

It's important to establish a solid testing strategy for data models. Focus on verifying what can be objectively checked, keeping tests clear and manageable.
Testing should prioritize sources and the transformations that impact data the most. Don't repeat tests for unchanged fields; it's better to test only what really matters.
For final metrics, shift the focus from basic checks to business-specific assumptions. Use adaptive monitors for outliers instead of hard-coded limits to ensure flexibility.

GroupBy #25: From Samza to Flink: A Decade of Stream Processing, DoorDash’s In-House Search Engine,Meta's DotSlash, Designing Metrics Trees

VuTrinh. • 19 implied HN points • 05 Mar 24

🕹 Technology Data Engineering

Stream processing has evolved significantly over the years, with frameworks like Samza and Flink leading the way in handling real-time data streams.
DoorDash developed its own search engine using Apache Lucene, achieving impressive performance improvements, like reduced latency and lower hardware costs.
Understanding metrics trees is essential for businesses as they visually represent how different inputs contribute to outputs, helping in decision-making.

sqlmesh init -t dlt --dlt-pipeline bluesky duckdb

davidj.substack • 71 implied HN points • 05 Dec 24

🕹 Technology Data Engineering

Using dlt to work with Bluesky API allows for easy data extraction. It saves time by handling metadata and schema changes automatically.
dlt simplifies dealing with nested data by creating separate tables. This makes it easier to manage complex data structures.
sqlmesh can quickly generate SQL models based on dlt pipelines. This feature streamlines the workflow and reduces manual setup time.

dlt windsurfing

davidj.substack • 71 implied HN points • 04 Dec 24

🕹 Technology Data Engineering

dlt is a Python tool that helps organize messy data into clear, structured datasets. It's easy to use and can quickly load data from many sources.
Using AI tools like Windsurf can make coding feel more collaborative. They help you find solutions faster and reduce the burden of coding from scratch.
Storing data in formats like parquet can make processing much quicker. Simplifying your data handling can save you a lot of time and resources.

GroupBy #7: The rise of data engineer, levels of abstractions, data modeling

VuTrinh. • 39 implied HN points • 31 Oct 23

🕹 Technology Data Engineering

Data engineers are becoming more important in the tech world as they handle vast amounts of data. Their role is focused on building systems that allow for efficient data handling and analysis.
Levels of abstraction in data engineering can be confusing, leading to challenges in understanding systems. It’s important to find a balance between using abstractions and being able to see the underlying processes.
Good data modeling practices can help organizations make better use of their time-series data. Understanding how to structure data effectively is key to unlocking its value.

GroupBy #23: Meta loves Python, How Uber Serves Over 40 Million Reads Per Second from Online Storage Using an Integrated Cache

VuTrinh. • 19 implied HN points • 20 Feb 24

🕹 Technology Data Engineering

Meta is heavily invested in Python, and they're working on improvements to enhance its performance and usability.
Uber has developed a powerful database called Docstore that can handle over 40 million reads per second, demonstrating their capability in data management.
Data, while useful, doesn't capture the complete reality, and it's important to recognize its limitations in understanding complex scenarios.

Google open-sources Vizier

MLOps Newsletter • 39 implied HN points • 20 Feb 23

🕹 Technology Data Engineering

Google open-sourced their blackbox optimization library named Vizier for reliable tuning and optimization.
Pinterest introduced Lightweight Ranking to recommend Pins with better relevance and build scalable ML models.
Netflix uses ML to predict Out of Memory issues in production, overcoming data engineering challenges like structuring data.

The Joe Reis Show w/ Zach Wilson

Joe Reis • 39 implied HN points • 15 Mar 23

🎙 Podcasts Data Engineering

Zach Wilson shares insights on data engineering and mental health challenges.
Podcast covers Zach Wilson's journey from Airbnb data engineer to entrepreneur.
The conversation also touches on ADHD and friendship between Joe Reis and Zach Wilson.

Series Kickoff — DDIA Book Review x Event Sourcing, Streaming and Modeling

🔮 Crafting Tech Teams • 39 implied HN points • 11 Jul 23

🕹 Technology Data Engineering

The post discusses a series on learnings from the book 'Designing Data-Intensive Applications' by Martin Kleppmann.
Topics covered include event sourcing, data engineering, event modeling, data flow, and complexity.
The author is preparing a new series to share insights and highlights from the mentioned book.

I made 1+1=0 in DuckDB

VuTrinh. • 19 implied HN points • 03 Feb 24

🕹 Technology Data Engineering

DuckDB is easy to use because it works like SQLite, running directly inside applications without needing a separate server. This makes it simpler to manage.
It processes data in batches through vectorization, which means it can handle multiple records at once, making operations faster than traditional row-by-row processing.
DuckDB supports ACID transactions, ensuring that data remains safe and reliable, which is important in data analytics and shared environments.

Catalog of Catalogs

davidj.substack • 59 implied HN points • 14 Nov 24

🕹 Technology Data Engineering

Data tools create metadata, which is important for understanding what's happening in data management. Every tool involved in data processing generates information about itself, making it a catalog.
Not all catalogs are for people. Some are meant for systems to optimize data processing and querying. These system catalogs help improve efficiency behind the scenes.
To make data more accessible, catalogs should be integrated into the tools users already work with. This way, data engineers and analysts can easily find the information they need without getting overwhelmed by unnecessary data.

GroupBy #18: Uber - GC Tuning for Improved Presto Reliability, How Meta is advancing GenAI

VuTrinh. • 19 implied HN points • 16 Jan 24

🕹 Technology Data Engineering

Uber improved its Presto reliability by tuning garbage collection. This helps the system run better and more dependably.
Meta is making strides in generative AI, focusing on how it can bring new advancements. The future looks promising for AI technologies.
Python 3.13 introduced a Just-In-Time (JIT) compiler, which could speed up programming processes. This is a beneficial development for Python users.

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

VuTrinh. • 19 implied HN points • 09 Jan 24

🕹 Technology Data Engineering

Pinterest has developed a new wide column database using RocksDB for better data handling. This helps them manage large amounts of data more efficiently.
Grab improved Kafka's fault tolerance on Kubernetes, ensuring their real-time data streaming service runs smoothly even when problems occur.
The newsletter will evolve, offering more content types like curated resources on data engineering and personal insights every week.

sqlmesh migrate

davidj.substack • 47 implied HN points • 20 Dec 24

🕹 Technology Data Engineering

If you're using dbt to run analytics, switching to sqlmesh is a good idea. It offers more features and is easy to learn while still being compatible with dbt.
sqlmesh helps manage data environments and is more comprehensive in handling analytics tasks compared to dbt. It's simpler to transition from dbt to sqlmesh than from older methods like stored procedures.
When using sqlmesh, think about where to run it and how to store its state. You have choices like using a different database or a cloud service, which can save you money and hassle.

Referral program and things you can expect from this newsletter

VuTrinh. • 19 implied HN points • 04 Jan 24

🕹 Technology Data Engineering

There's a referral program where you can refer friends to subscribe and earn gifts as rewards.
You can expect two main types of emails: one that curates valuable data engineering resources and another that shares insights I've learned from others.
You have control over how many emails you receive, so you can choose to get only the ones you want.

GroupBy #16: Uber's Anomaly Detection & Alerting System, many layers of data lineage

VuTrinh. • 19 implied HN points • 02 Jan 24

🕹 Technology Data Engineering

Uber has developed an anomaly detection system called uVitals, which helps identify issues before they become major problems. It analyzes data patterns to catch anomalies early.
Data modeling is essential for creating structured databases that allow for better analysis and comparisons. It's important for data projects to have clear designs.
As the field of data engineering evolves, new roadmaps and resources are emerging to guide professionals in developing necessary skills. Staying updated can help engineers advance their careers.

Good data engineers are lazy

Data People Etc. • 213 implied HN points • 30 Mar 23

🕹 Technology Data Engineering

Good data engineers strive for automation to be as lazy as possible.
Reevaluate the necessity of tools in the data stack and aim for streamlined, efficient systems.
Declarative paradigm and proper architecture design are crucial for data engineers to optimize efficiency and consistency.

The Modern Data Stack is Dead… Long Live the Modern Data Stack - Part 1

davidj.substack • 167 implied HN points • 19 Jul 23

🕹 Technology Data Engineering

The Modern Data Stack (MDS) community has grown significantly over the years with various meetups and events.
Using tools like Snowflake, dbt, and Looker in the Modern Data Stack improves data capabilities and productivity.
Although some criticize the Modern Data Stack and its imperfections, it has greatly enhanced data handling and analytics for many organizations.

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

VuTrinh. • 19 implied HN points • 19 Dec 23

🕹 Technology Data Engineering

To be a Senior Individual Contributor at Meta, focus on quickly adding value and aligning with the organization's goals. It's about making an impact and building good relationships within the team.
Data modeling involves creating a shared understanding between business and data teams. It's essential for delivering valuable insights and ensuring everyone is on the same page.
Job hopping in data engineering can be successful with the right approach. Make sure to deliver value early on and always be ready for new opportunities while enjoying your work-life balance.

The Data Hierarchy of Needs

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Nov 23

🕹 Technology Data Engineering

Data can be structured in a hierarchy similar to Maslow's Hierarchy of Needs, where each level is necessary for the enjoyment of the level above it. This concept applies to data engineering pipelines.
Data pipelines are crucial for deriving business value, even if they are complex and not directly visible. Architectural considerations and infrastructure choices play a significant role in making data a priority in a business.
When considering data infrastructure, such as data ingestion tools, cloud warehouses, BI tools, and others, it's important to plan the entire stack and not just jump to specific infrastructure. Consider aspects like version control, security, integration, and orchestration.

The hottest SQL tools you have no use for

The Orchestra Data Leadership Newsletter • 19 implied HN points • 16 Nov 23

🕹 Technology Data Engineering

SQL is a powerful data manipulation tool that has different dialects and evolved over time to fit various database software needs.
New SQL tools like dbt, SQLMesh, and Semantic Data Fabric aim to improve data testing, quality, and governance in data engineering processes.
The value in data engineering lies more in processes, culture, and diligence, rather than solely relying on fancy tools to prevent mistakes.

Should Data Teams care about Data Contracts?

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Nov 23

🕹 Technology Data Engineering

Consider data contracts if your internal data changes often to ensure collaboration between software engineering and data engineering teams.
If you have important metrics that depend on software engineering actions, like defining 'Active Users,' data contracts can help maintain data quality.
In cases where software engineering and data engineering roles overlap, implementing data contracts can streamline data ingestion processes and improve data quality.

Why Snowflake’s Clone command changes the game for CI/CD in Data

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Oct 23

🕹 Technology Data Engineering

The Snowflake Clone command allows for cheap and quick testing of data during Continuous Integration flows, showing significant cost and time improvement compared to traditional create table commands.
Continuous Deployment can be facilitated through Snowflake clones, as they are relatively inexpensive, removing cost barriers for Data Teams in implementing effective CD processes.
The Clone command, available since at least 2016, is invaluable for Data Teams as it enables CI/CD pipelines, essential for deploying data into production reliably and efficiently via a data release pipeline.

GroupBy #6: Meta developer tool, trillions of data points at Discord and Uber's data cycle management

VuTrinh. • 19 implied HN points • 24 Oct 23

🕹 Technology Data Engineering

Meta has introduced developer tools that help manage large-scale projects efficiently. These tools assist engineers in solving problems and improving systems.
Big companies like Discord and Uber are using massive data points to create valuable insights. This helps them to effectively manage their data and understand trends better.
Data engineering continues to evolve, with tools like BigQuery and dbt Mesh enhancing data practices. Staying updated with these tools can improve data analysis and management.

CI / CD for Data Engineering (Python)

The Orchestra Data Leadership Newsletter • 19 implied HN points • 22 Oct 23

🕹 Technology Data Engineering

Understanding basic CI / CD for Python code in a Data Engineering context is crucial for Data Engineering Leaders.
For unit tests, use pytest to ensure functions work correctly, and for integration tests, test connections to third-party APIs.
Implementing CI / CD involves writing code, testing and linting locally, and then deploying to a merge environment to ensure code compatibility.

GroupBy #5: The story of S3, Kafka at scale and the boring is back

VuTrinh. • 19 implied HN points • 17 Oct 23

🕹 Technology Data Engineering

S3 is a big storage system used for data, and understanding how it's built can help improve data handling. It's cool to know how tech like this works.
Running Kafka at scale is interesting, especially for companies like Pinterest. It shows how important reliable data flow is in tech.
There's a trend of making things simpler and more efficient in engineering. Sometimes, going back to basics can solve complex problems.

💡On-Demand Webinar: Designing & Scaling FanDuel's Machine Learning Platform

TheSequence • 77 implied HN points • 26 Jan 24

🕹 Technology Data Engineering

FanDuel designed a powerful ML platform to deliver personalized experiences to users
Technology choices and frameworks are crucial in building an effective ML platform
Managing data backfills and orchestrating the process is important when features change

Why dbt Labs acquired Transform

Three Data Point Thursday • 19 implied HN points • 20 Apr 23

🕹 Technology Data Engineering

Dbt Labs acquired Transform to target a new market segment beyond analytics engineers.
Tech companies typically expand by starting small, then broadening their market focus and adding features.
Data is not the same as analytics; a top-down approach to making data vital in a company is crucial.

Gradient Flow #46: Smarter Language Models; Data Engineering Trends

Gradient Flow • 99 implied HN points • 04 Nov 21

🕹 Technology Data Engineering

Data scientists should transition into social scientists in addition to being computer scientists.
The report presents insights from a global online survey of 372 respondents on data engineering trends and challenges.
Information on improvements in large language models, modernizing data integration, and the importance of data quality is shared in the podcast.

GroupBy #1

VuTrinh. • 19 implied HN points • 08 Sep 23

🕹 Technology Data Engineering

Kappa architecture simplifies data processing by combining batch and stream processing. This makes handling data more efficient compared to the traditional Lambda architecture.
Presto is a powerful tool for querying large datasets, and Meta has valuable insights on using it effectively. Learning from their experience can help other teams improve their data operations.
Data quality is crucial in analytics, and there are specific metrics to help measure it. Keeping track of these can prevent problems that arise from poor data.

The hottest Data Engineering Substack posts right now

Three Data Point Thursday • 39 implied HN points • 11 Jan 24

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Apr 24

Practical Data Engineering Substack • 59 implied HN points • 18 Sep 23

Practical Data Engineering Substack • 59 implied HN points • 01 Oct 23

Top 5 HN Posts of the day • 19 implied HN points • 29 Mar 24

Practical Data Engineering Substack • 2 HN points • 15 Aug 24

VuTrinh. • 19 implied HN points • 19 Mar 24

VuTrinh. • 39 implied HN points • 05 Dec 23

davidj.substack • 59 implied HN points • 12 Feb 25

Gradient Flow • 179 implied HN points • 26 May 22

Inside Data by Mikkel Dengsøe • 24 implied HN points • 11 Jul 25

VuTrinh. • 19 implied HN points • 05 Mar 24

davidj.substack • 71 implied HN points • 05 Dec 24

davidj.substack • 71 implied HN points • 04 Dec 24

VuTrinh. • 39 implied HN points • 31 Oct 23

VuTrinh. • 19 implied HN points • 20 Feb 24

MLOps Newsletter • 39 implied HN points • 20 Feb 23

Joe Reis • 39 implied HN points • 15 Mar 23

🔮 Crafting Tech Teams • 39 implied HN points • 11 Jul 23

VuTrinh. • 19 implied HN points • 03 Feb 24

davidj.substack • 59 implied HN points • 14 Nov 24

VuTrinh. • 19 implied HN points • 16 Jan 24

VuTrinh. • 19 implied HN points • 09 Jan 24

davidj.substack • 47 implied HN points • 20 Dec 24

VuTrinh. • 19 implied HN points • 04 Jan 24

VuTrinh. • 19 implied HN points • 02 Jan 24

Data People Etc. • 213 implied HN points • 30 Mar 23

davidj.substack • 167 implied HN points • 19 Jul 23

VuTrinh. • 19 implied HN points • 19 Dec 23

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Nov 23

The Orchestra Data Leadership Newsletter • 19 implied HN points • 16 Nov 23

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Nov 23

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Oct 23

VuTrinh. • 19 implied HN points • 24 Oct 23

The Orchestra Data Leadership Newsletter • 19 implied HN points • 22 Oct 23

VuTrinh. • 19 implied HN points • 17 Oct 23

TheSequence • 77 implied HN points • 26 Jan 24

Three Data Point Thursday • 19 implied HN points • 20 Apr 23

Gradient Flow • 99 implied HN points • 04 Nov 21

VuTrinh. • 19 implied HN points • 08 Sep 23