The hottest Data Engineering Substack posts right now

And their main takeaways

Data Science Weekly - Issue 506

Data Science Weekly Newsletter • 399 implied HN points • 04 Aug 23

🕹 Technology Data Engineering

Integrating large language models into systems can be done using seven key patterns that balance performance and cost.
Ethics in AI isn't just about explainability and fairness; we need a deeper understanding to prevent overall harm from AI systems.
New approaches in robotics focus on current challenges and opportunities while advancing understanding of AI's role in planning tasks.

A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn

VuTrinh. • 99 implied HN points • 30 Mar 24

🕹 Technology Data Engineering

Apache Pinot is a real-time OLAP system developed by LinkedIn that allows for fast analytics on large sets of data. It can handle tens of thousands of analytical queries per second while providing near-instant results.
The architecture is divided into key components like controllers, brokers, and servers which work together to process queries and manage data efficiently. Pinot is designed to quickly ingest and query fresh data from various sources, ensuring low latency.
Pinot supports various indexing strategies, like star-tree indexes, to optimize complex queries. This enables faster query responses by pre-aggregating data, making it easier to analyze large volumes of information.

I spent 3 hours figuring out how BigQuery inserts, deletes and updates data internally. Here's what I found.

VuTrinh. • 139 implied HN points • 17 Feb 24

🕹 Technology Data Engineering

BigQuery manages data using immutable files, meaning once data is written, it cannot be changed. This helps in processing data efficiently and maintains data consistency.
When you perform actions like insert, delete, or update, BigQuery creates new files instead of changing existing ones. This approach helps in features like time travel, which lets you view past states of data.
BigQuery uses a system called storage sets to handle operations. These sets help ensure processes are performed atomically and consistently, maintaining data integrity during changes.

Data Science Weekly - Issue 516

Data Science Weekly Newsletter • 299 implied HN points • 13 Oct 23

🕹 Technology Data Engineering

The newsletter is deciding whether to publish twice a week, but will stick to one issue for now to review feedback from readers.
There's a focus on providing useful resources for data science, including articles and job opportunities in the field.
New tools and methods in AI and data engineering are highlighted, addressing challenges like data integration and AI model training.

GroupBy #35: The Netflix Data Engineering Stack, Atlassian - Evolve the data platform with a Deployment Capability

VuTrinh. • 59 implied HN points • 14 May 24

🕹 Technology Data Engineering

Netflix has a strong data engineering stack that supports both batch and streaming data pipelines. It focuses on building flexible and efficient data architectures.
Atlassian has revamped its data platform to include a new deployment capability inspired by technologies like Kubernetes. This helps streamline their data management processes.
Migrating from dbt Cloud can teach valuable lessons about data development. Companies should explore different options and learn from their migration journeys.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

SAI #22: Decomposing the Data System.

SwirlAI Newsletter • 294 implied HN points • 18 Mar 23

🕹 Technology Data Engineering

Learning to decompose a data system is crucial for better reasoning and understanding of large infrastructure
Decomposing a data system allows for scalability, identification of bottlenecks, and total event processing latency optimization
The different layers in a data system include data ingestion, transformation, and serving layers, each with specific functions and technologies

The Truth about Prefect, Mage, and Airflow.

Data Engineering Central • 294 implied HN points • 10 Apr 23

🕹 Technology Data Engineering

Airflow has been a dominant tool for data orchestration, but new tools like Prefect and Mage are challenging its reign.
Prefect focuses on using Python for defining tasks and workflows, but may not offer enough differentiation from Airflow.
Mage stands out for its focus on engineering best practices and providing a smoother developer experience, making it a compelling choice over Airflow for scaling up data pipelines.

Data Science Weekly - Issue 512

Data Science Weekly Newsletter • 299 implied HN points • 14 Sep 23

🕹 Technology Data Engineering

Nvidia has been a leader in AI technology, but its dominance might not last. Changes in the market and technology could shift the competitive landscape soon.
For those who know R and want to learn Python, there are resources available to help make the transition easier. These resources provide advice and tips catered to R users.
Reinforcement Learning with Human Feedback (RLHF) is an important part of training large language models. It's essential for improving how these models understand and respond to human preferences.

I spent 6 hours understanding the design principles of BigQuery. Here's what I found

VuTrinh. • 159 implied HN points • 20 Jan 24

🕹 Technology Data Engineering

BigQuery uses SQL again after moving away from it, making data analysis fast and easy. Users can now analyze huge datasets quickly without complex coding.
It separates storage and compute resources, allowing for better performance and flexibility. This means you can scale them independently, which is very efficient.
Dremel's serverless architecture means you don’t need to manage servers. You just use SQL, and everything else is automatically handled for you.

Growing From Analyst To Data Engineer In 100 Days

SeattleDataGuy’s Newsletter • 1165 implied HN points • 02 Jan 24

🕹 Technology Data Engineering

Breaking into data engineering may be easier through lateral moves, like from data analyst to data engineer.
The 100-day plan discussed is not meant to master data engineering but to help commit to learning and identify areas for improvement.
The plan includes reviewing basics, diving deeper, building a mini project, surveying tools, best practices, and committing to a final project.

A Closer Look Into Databricks's Photon Engine

VuTrinh. • 79 implied HN points • 13 Apr 24

🕹 Technology Data Engineering

Photon engine uses columnar data layout to manage memory efficiently, allowing it to process data in batches. This helps in speeding up data operations.
It supports adaptive execution, which means the engine can change how it processes data based on the input. This can significantly improve performance, especially when data has many NULLs or inactive rows.
Photon integrates with Databricks runtime and Spark SQL, allowing it to enhance existing workloads without completely replacing the old system, making transitions smoother.

GroupBy #34: Hybrid Transactional/Analytical Storage, From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

VuTrinh. • 59 implied HN points • 07 May 24

🕹 Technology Data Engineering

Hybrid transactional/analytical storage combines different types of data processing. This helps companies like Uber manage their data more efficiently.
The shift from predictive to generative AI is changing how companies use machine learning. Uber's Michelangelo platform shows how this new approach can improve AI applications.
Data reliability and observability are important for businesses as their data grows. Companies need tools to quickly find and fix data issues to keep their operations running smoothly.

SAI #19: The Data Value Chain.

SwirlAI Newsletter • 255 implied HN points • 25 Feb 23

🕹 Technology Data Engineering

Understanding the Data Value Chain is essential for building successful Data Products.
Implementing Data Contracts in the Data Pipeline ensures data quality and prevents unexpected outages.
Knowing the 4 types of ML Model Deployment helps in deploying machine learning models effectively.

SAI Notes #01: Watermarks in Stream Processing, SQL Query order of Execution.

SwirlAI Newsletter • 255 implied HN points • 07 May 23

🕹 Technology Data Engineering

Watermarks in Stream Processing help handle event lateness and decide when to treat data as 'late data'.
In SQL Query execution, the order is FROM and JOIN, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.
To optimize SQL Queries, reduce dataset sizes for joins and use subqueries for pre-filtering.

Data Science Weekly - Issue 510

Data Science Weekly Newsletter • 279 implied HN points • 31 Aug 23

🕹 Technology Data Engineering

Autonomous drones can now race at human champion levels using deep reinforcement learning. This shows how advanced technology can mimic skilled human behavior in competitive sports.
Google is rapidly developing its AI capabilities and plans to surpass GPT-4 by a significant margin soon. This could lead to more powerful AI tools for various applications.
Reinforced Self-Training (ReST) is a new method for improving language models by aligning their outputs with human preferences. It offers better translation quality and can be done efficiently with less data.

A detailed guide to running dbt Core in Production in AWS on ECS

The Orchestra Data Leadership Newsletter • 79 implied HN points • 28 Mar 24

🕹 Technology Data Engineering

A detailed guide to running dbt Core in production in AWS on ECS is outlined, focusing on achieving cost-effective and reliable execution.
Running dbt in production is not highly compute-intensive, as it primarily serves as an orchestrator, making it more cost-efficient compared to running Python code that utilizes compute resources.
By setting up dbt Core on ECS in AWS and using Orchestra, you can achieve a scalable, cost-effective solution for self-hosting dbt Core with full visibility and control.

Going Faster is the Greatest UX

Sung’s Substack • 79 implied HN points • 26 Mar 24

🕹 Technology Data Engineering

Civilization advances by extending the number of important operations which we can perform without thinking about them.
In data engineering, the focus on speed is increasing, with the need for tools to actually make users go faster, not just show possibilities.
To improve workflow efficiency, demand every element to be faster without compromises.

Data Science Weekly - Issue 507

Data Science Weekly Newsletter • 279 implied HN points • 11 Aug 23

🕹 Technology Data Engineering

Large Language Models (LLMs) can take over some data tasks, but they won't replace all data jobs. Many tasks still need human insight and specialized skills.
Understanding machine learning theory takes a long time, but in the industry, practical implementation is often more important. It's crucial to balance theory and hands-on skills.
The new field of mechanistic interpretability is growing. Researchers are looking at how models learn and generalize, aiming to make sense of how AI works.

Data Sharing in the Real World: Why SFTP Remains Essential for Companies

SeattleDataGuy’s Newsletter • 400 implied HN points • 31 Oct 24

🕹 Technology Data Engineering

SFTP stands for Secure File Transfer Protocol, and it's a popular method for companies to send and receive data securely, like sending packages in the digital world. Many businesses, even big tech ones, still rely on SFTP instead of newer methods.
Setting up SFTP jobs requires careful planning, especially for user authentication and file encryption. Using SSH keys and methods like PGP encryption helps ensure the data remains safe during transfers.
Although there are more advanced data-sharing technologies emerging, SFTP isn't going away anytime soon. Many companies still rely on SFTP for their data needs, showing its continued importance in the industry.

Organizations are on the verge of losing control of their data forever

The Orchestra Data Leadership Newsletter • 79 implied HN points • 21 Mar 24

🕹 Technology Data Engineering

Organizations are at risk of losing control of their data due to lack of focus on data quality and overlooking data as a value-driver.
Large Language Models (LLMs) can improve data quality control and help in automating tasks effectively with context.
Before implementing LLMs, organizations should prioritize data cleaning, auditing, and defining valuable datasets.

Data Science Weekly - Issue 535

Data Science Weekly Newsletter • 99 implied HN points • 23 Feb 24

🕹 Technology Data Engineering

Scaling AI tools like ChatGPT involves overcoming many engineering challenges to handle large user demands. It's important to manage growth effectively to keep users satisfied.
There's a lot of information out there about generative AI, making it hard to keep up. A guidebook can help condense this information and provide practical insights.
Linear regression is still a valuable tool in data science. Sometimes going back to basics can yield better results than relying on complex models.

How Rust and Python manage memory

VuTrinh. • 119 implied HN points • 27 Jan 24

🕹 Technology Data Engineering

Rust uses ownership to manage memory, meaning each value has a single owner. When that owner goes out of scope, the memory gets freed automatically.
Python uses a garbage collector to handle memory which counts how many references point to an object. Once there are no references left, it cleans up the unused memory.
Rust's approach gives developers more control but requires them to understand ownership rules, while Python's method is easier for beginners but can slow down performance.

Data Science Weekly - Issue 491

Data Science Weekly Newsletter • 419 implied HN points • 21 Apr 23

🕹 Technology Data Engineering

AI academics are facing challenges keeping up with private sector investments. It's important for them to find survival strategies to remain competitive.
There are ongoing discussions about the rapid progress in machine learning and how it can be overwhelming for developers. Many are sharing thoughts on adapting to this fast-paced change.
Visualizing neural networks properly can help clarify concepts. There is a push for better diagrams to avoid confusion in understanding how these networks function.

I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.

VuTrinh. • 79 implied HN points • 16 Mar 24

🕹 Technology Data Engineering

Amazon Redshift is designed as a massively parallel processing data warehouse in the cloud, making it effective for handling large data sets efficiently. It changes how data is stored and queried compared to traditional systems.
The system uses a unique compilation service that generates specific code for queries, which helps speed up processing by caching compiled code. This means Redshift can reuse code for similar queries, reducing wait times.
Redshift also uses machine learning techniques to optimize operations, such as predicting resource needs and automatically adjusting performance settings. This allows it to scale effectively and maintain high performance during heavy workloads.

GroupBy #31: Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore, Grab Experiment Decision Engine

VuTrinh. • 59 implied HN points • 16 Apr 24

🕹 Technology Data Engineering

Uber successfully migrated over a trillion entries of its ledger data to a new database called LedgerStore without causing disruptions. This shows how careful planning can make big data moves smooth.
Airbnb has open-sourced a machine learning feature platform called Chronon, which helps manage data and makes it easier for engineers to work with different data sources. This promotes collaboration and innovation in the tech community.
The GrabX Decision Engine boosts experimentation on online platforms by providing tools for better planning and analyzing experiments. This can lead to more informed decisions and improved outcomes in projects.

Unit Testing for Data Engineers.

Data Engineering Central • 216 implied HN points • 13 Feb 23

🕹 Technology Data Engineering

Data Engineers often struggle with implementing unit tests due to factors like focus on moving fast and historical lack of emphasis on testing.
Unit testable code in data engineering involves keeping functions small, minimizing side effects, and ensuring reusability.
Implementing unit tests can elevate a data team's performance and lead to better software quality and bug control.

AI Web scraping use-cases for Data Teams: Intelligence gathering

The Orchestra Data Leadership Newsletter • 39 implied HN points • 21 May 24

🕹 Technology Data Engineering

Web scraping with AI can enhance intelligence gathering by efficiently collecting and processing data from various public sources on the internet.
Leveraging Large Language Models (LLMs) can improve the accuracy and robustness of web scraping systems when dealing with changes in HTML code structure.
Using tools like Nimble for web scraping allows for more efficient and accurate data collection by training models on different types of websites for specific use cases.

Effective Data Governance Can Only Exist Within Data Orchestration

The Orchestra Data Leadership Newsletter • 99 implied HN points • 07 Feb 24

🕹 Technology Data Engineering

Effective data governance requires incorporating preventive measures within data orchestration layers.
Current data governance tools predominantly offer post-action analytics rather than proactive preventive measures.
By integrating role-based access control and monitoring in the orchestration layer, organizations can shift to a more proactive data governance approach.

Quant Letter: January 2026, Week-3

The Parlour • 8 implied HN points • 16 Jan 26

💰 Finance Data Engineering

Fine-tuning LLaMA-3-8B with instruction tuning and LoRA noticeably improves financial named-entity recognition, helping convert messy reports into structured data.
New work on adaptive dataflow for financial time-series points to better ways to process streaming market data and boost model efficiency or accuracy.
This newsletter curates recent finance ML papers and is available by subscription, with some free previews for readers who want quick research updates.

I spent 5 hours learning how ClickHouse built their internal data warehouse.

VuTrinh. • 1 HN point • 21 Sep 24

🕹 Technology Data Engineering

ClickHouse built its internal data warehouse to better understand customer usage and improve its services. They collected data from multiple sources to gain valuable insights.
They use tools like Airflow for scheduling and Superset for data visualization, making their data processing efficient. This setup allows them to handle large volumes of data daily.
Over time, ClickHouse evolved its system by adding dbt for data transformation and improving user experiences with better SQL query tools. They also incorporated real-time data to enhance their reporting.

Data Science Weekly - Issue 490

Data Science Weekly Newsletter • 379 implied HN points • 13 Apr 23

🕹 Technology Data Engineering

Data science is evolving quickly, and many new tools and techniques are being developed. This opens up exciting job opportunities in various fields like AI and machine learning.
Using programming languages like R and SQL can extend beyond traditional data analysis. They can be powerful tools for creative applications in data science.
Learning and implementing good practices in software development, such as automating tests and improving code efficiency, can save time and resources in data science projects.

BigQuery processing engine: Shuffle

VuTrinh. • 119 implied HN points • 06 Jan 24

🕹 Technology Data Engineering

BigQuery uses a processing engine called Dremel, which takes inspiration from how MapReduce handles data. It improves how data is shuffled between workers for faster processing.
Traditional approaches have issues like resource fragmentation and unpredictable scaling when dealing with huge data. Dremel solves this by managing shuffle storage separately from the worker, which helps in scaling and resource management.
By separating the shuffle layer, Dremel reduces latency, improves fault tolerance, and allows for more flexible worker allocation during execution. This makes it easier to handle larger data sets efficiently.

I spent 7 hours reading another paper to understand more about Snowflake's internal. Here's what I found.

VuTrinh. • 79 implied HN points • 02 Mar 24

🕹 Technology Data Engineering

Snowflake has a unique design with three main layers: storage, virtual warehouse, and cloud service. This structure helps manage data efficiently and ensures high availability.
The system uses a special ephemeral storage for temporary data during queries, which allows for quick access and less strain on the overall system. This helps with performance and reduces network load.
Snowflake is designed for flexibility, allowing it to adapt resources based on customer needs and workloads. This elasticity helps provide better performance and efficiency.

Worldbuilding with data

Data People Etc. • 231 implied HN points • 11 Feb 25

🕹 Technology Data Engineering

Data is more powerful when it has a purpose. It should tell a clear story, otherwise it's just clutter.
Building a strong data system is like creating a world. A good structure connects different pieces and helps everyone understand the bigger picture.
Data engineering is important because it helps manage and present large amounts of information, making sure everything works smoothly and accurately.

GroupBy #29: Scaling AI/ML Infrastructure at Uber, The Sisyphean struggle and the new era of data infrastructure

VuTrinh. • 59 implied HN points • 02 Apr 24

🕹 Technology Data Engineering

Uber is focusing on building strong AI and machine learning infrastructure to keep up with the growing complexity of their models. This involves using both CPUs and GPUs for better efficiency.
Data management is becoming crucial for companies like Netflix as they deal with massive amounts of production data. They are developing tools to effectively manage and optimize this data.
The data streaming landscape is evolving, with new technologies emerging that make handling data easier and more efficient. This is changing how companies approach data infrastructure.

How to grow from a mid-level to senior Data Engineer

SeattleDataGuy’s Newsletter • 694 implied HN points • 14 Feb 24

🕹 Technology Data Engineering

To grow from mid to senior level, it's important to continuously learn and improve, share new knowledge, work on code improvements, and become an expert in a certain domain.
Making the team better is crucial - focus on mentoring, sharing knowledge, and creating a positive team environment. Think beyond individual tasks to impact the overall team outcomes.
Seniority includes building not just technical solutions, but solutions that customers love. Challenge requirements, understand the business and product, and take initiative in problem-solving.

This well-known data company could be reversing the ETL to ELT shift

The Orchestra Data Leadership Newsletter • 79 implied HN points • 25 Feb 24

🕹 Technology Data Engineering

ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) have been key data engineering paradigms, but with the rise of the cloud, the need for in-transit data transformation has decreased.
Fivetran, a widely known data company, is potentially shifting back to ETL methods by offering pre-built transformation features, effectively simplifying the data modeling process for users.
There seems to be a trend towards a possible resurgence of ETL practices in the data industry, with companies like Fivetran potentially leading the way in providing ETL-like services within their platforms.

I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.

VuTrinh. • 79 implied HN points • 24 Feb 24

🕹 Technology Data Engineering

BigQuery processes SQL queries by planning, optimizing, and executing them. It starts by validating the query and creating an efficient execution plan.
The query execution uses a dynamic tree structure that adjusts based on data characteristics. This helps to manage different types of queries more effectively.
Key components of BigQuery include the Query Master for planning, the Scheduler for assigning resources, and Worker Shards that carry out the actual computations.

Data Science Weekly - Issue 504

Data Science Weekly Newsletter • 239 implied HN points • 21 Jul 23

🕹 Technology Data Engineering

AI companies are complicated and must consider many factors like research, funding, and competition. Understanding these can help predict how they might evolve in the future.
Debriefs, or team discussions after projects, can greatly boost team performance. They help everyone learn from experiences and improve future collaboration.
New research shows that specific ingredient pairings in food can be explained by flavor networks. This indicates there are universal patterns in how different foods complement each other.

GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive

VuTrinh. • 59 implied HN points • 26 Mar 24

🕹 Technology Data Engineering

Tableflow allows you to easily turn Apache Kafka topics into Iceberg tables, which could change how streaming data is managed.
Kafka's new tiered storage feature helps separate compute and storage, making it easier to manage resources and keep systems running smoothly.
Data governance is important but can be lackluster if it doesn't show clear business benefits, making us rethink its role in today's data landscape.