The hottest Data Engineering Substack posts right now

And their main takeaways

How does Uber build real-time infrastructure to handle petabytes of data every day?

VuTrinh. • 659 implied HN points • 23 Mar 24

Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.

Data Science Weekly - Issue 529

Data Science Weekly Newsletter • 999 implied HN points • 12 Jan 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Software Development

Using ChatGPT can help you budget better. It can track and categorize your spending easily.
When coding, it's important to find a balance between moving quickly and keeping your code well-structured. This is a real challenge for many developers.
Language models, like GPT-4, are becoming very advanced, but there are big philosophical questions about what that really means for intelligence and understanding.

GroupBy #43: Uber | Kafka - The Tiered Storage

VuTrinh. • 139 implied HN points • 09 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Information Systems Big Data

Uber recently introduced Kafka Tiered Storage, which allows storage and compute resources to work separately. This means you can add storage without needing to upgrade processing power.
The tiered storage system has two parts: local storage for fast access and remote storage for long-term data. This setup helps manage data efficiently and keeps the local storage less cluttered.
When you need older data, it can be accessed directly from the remote storage, allowing faster performance for applications that need quick access to recent messages.

GroupBy #44: Meta | The Data Stack

VuTrinh. • 119 implied HN points • 16 Jul 24

🕹 Technology Data Engineering Infrastructure Data Analytics Software Development Real-Time Processing

Meta uses a complex data warehouse to manage millions of tables and keeps data only as long as it's needed. Data is organized into namespaces for efficient querying.
They built tools like iData for data discovery and Scuba for real-time analytics. These tools help engineers find and analyze data quickly.
Data engineers at Meta develop pipelines mainly with SQL and Python, using internal tools for orchestration and monitoring to ensure everything runs smoothly.

Data Science Weekly - Issue 527

Data Science Weekly Newsletter • 959 implied HN points • 29 Dec 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Analytics

This week, there's a focus on using data science techniques for practical decision-making, highlighted by an interview with Steven Levitt, who discusses making tough choices using data.
There's a roundup of AI developments from 2023, showing how the field has evolved over the past year, which can help professionals stay updated.
Understanding data quality is essential, as it directly impacts how useful data is for decision-making and analysis in any organization.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

GroupBy #40: Data Infrastructure at Airbnb

VuTrinh. • 179 implied HN points • 18 Jun 24

🕹 Technology Data Engineering Software Development Infrastructure Open Source Scalability

Airbnb focuses on using open-source tools and contributing back to the community. This helps them build a strong and collaborative data infrastructure.
Their data infrastructure prioritizes scalability and uses specific clusters for different types of jobs. This approach ensures that critical tasks run efficiently without overwhelming the system.
Airbnb has improved their data processing performance significantly, reducing costs while increasing speed. This was achieved through careful planning and migration of their Hadoop clusters.

Data Engineering in 2024. What I'm Seeing.

Joe Reis • 805 implied HN points • 13 Jan 24

🕹 Technology Data Engineering AI Integration Skills Development

Data engineering is evolving with more tooling abstraction to simplify development.
Focus on financial operations and adding business value is becoming crucial for data teams.
Traditional data practices like data modeling and governance are making a resurgence in importance.

How does Uber handle petabytes of Spark shuffle data every day?

VuTrinh. • 159 implied HN points • 22 Jun 24

🕹 Technology Data Engineering Big Data Cloud Computing Software Development Distributed Systems

Uber uses a Remote Shuffle Service (RSS) to handle large amounts of Spark shuffle data more efficiently. This means data is sent to a remote server instead of being saved on local disks during processing.
By changing how data is transferred, the new system helps reduce failures and improve the lifespan of hardware. Now, servers can handle more jobs without crashing and SSDs last longer.
RSS also streamlines the process for the reduce tasks, as they now only need to pull data from one server instead of multiple ones. This saves time and resources, making everything run smoother.

Data Science Weekly - Issue 554

Data Science Weekly Newsletter • 119 implied HN points • 04 Jul 24

🕹 Technology Data science AI Machine Learning Data Engineering Visualization

Staying updated in data science, AI, and machine learning is essential for improving skills and knowledge. Weekly newsletters provide curated articles and resources that help you keep up with the latest trends.
Effective structuring of data science teams can greatly enhance productivity. Learning from past experiences on team reorganizations can help in clarifying roles and increasing effectiveness.
Building interactive dashboards in Python can make data more accessible. Using tools like PostgreSQL and specific libraries can simplify the process and enhance data visualization.

Issue #10 - The Data Lifecycle

The Data Ecosystem • 159 implied HN points • 16 Jun 24

🕹 Technology Data science Data Management Data security Data Analysis Data Engineering Data Visualization

The data lifecycle includes all the steps from when data is created until it is no longer needed. This helps organizations understand how to manage and use their data effectively.
Different people and companies might describe the data lifecycle in slightly different ways, which can be confusing. It's important to have a clear understanding of what each term means in context.
Properly managing data involves stages like storage, analysis, and even disposal or archiving. This ensures data remains useful and complies with regulations.

Data Science Weekly - Issue 550

Data Science Weekly Newsletter • 179 implied HN points • 07 Jun 24

🕹 Technology Data science AI Machine Learning Computing Data Engineering

Curiosity in data science is important. It's essential to critically assess the quality and reliability of the data and models we use, especially when making claims about complex issues like COVID-19.
New fields, like neural systems understanding, are blending different disciplines to explore complex questions. This approach can help unravel how understanding works in both humans and machines.
Understanding AI advancements requires keeping track of evolving resources. It’s helpful to have a well-organized guide to the latest in AI learning resources as the field grows rapidly.

Data Science Weekly - Issue 555

Data Science Weekly Newsletter • 99 implied HN points • 11 Jul 24

🕹 Technology Data science AI Machine Learning Data Engineering Data Visualization

Large language models can sometimes create false or confusing information, a problem known as hallucination. Understanding the cause of these mistakes can help improve their accuracy.
Good data visualizations are important to effectively communicate patterns and insights. Poorly designed visuals can lead to misunderstandings, especially among those not familiar with graphics.
There's an ongoing debate about copyright in the context of generative AI. Many believe it would be better to focus on finding compromises rather than pursuing strict legal battles.

Data Science Weekly - Issue 552

Data Science Weekly Newsletter • 139 implied HN points • 20 Jun 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

Notebooks can be easy to use, but they might make you lazy in coding. It's important to follow good practices even when using them.
When handling large datasets, it's crucial to learn how to scale effectively. Knowing how to use resources wisely can help you reach your goals faster.
Retrieval Augmented Generation (RAG) can improve how models generate information. It's complex, but understanding it can boost the performance of your projects.

Data Science Weekly - Issue 556

Data Science Weekly Newsletter • 79 implied HN points • 18 Jul 24

🕹 Technology Data science Artificial Intelligence Machine Learning Programming Data Engineering

AI research in China is progressing rapidly, but it hasn't received much attention compared to developments in the US. There are many complexities in understanding the implications of this advancement.
There are new methods to improve large language models (LLMs) using production data, which can enhance their performance over time. A structured approach to analyzing data quality can lead to better outcomes.
Evaluating modern machine learning models can be challenging, leading to some questionable research practices. It's important to understand these issues to ensure more accurate and reproducible results.

The Architecture of Apache Druid

VuTrinh. • 139 implied HN points • 15 Jun 24

🕹 Technology Data Engineering Data architecture Big Data

Apache Druid is built to handle real-time analytics on large datasets, making it faster and more efficient than Hadoop for certain tasks.
Druid uses a variety of node types—like real-time, historical, broker, and coordinator nodes—to manage data, process queries, and ensure everything runs smoothly.
The architecture allows for quick data retrieval while maintaining high availability and performance, making it a strong choice for applications that need fast, interactive data exploration.

sqlmesh init -t dlt --dlt-pipeline bluesky duckdb

davidj.substack • 71 implied HN points • 05 Dec 24

🕹 Technology Software Data Engineering APIs Databases Automation

Using dlt to work with Bluesky API allows for easy data extraction. It saves time by handling metadata and schema changes automatically.
dlt simplifies dealing with nested data by creating separate tables. This makes it easier to manage complex data structures.
sqlmesh can quickly generate SQL models based on dlt pipelines. This feature streamlines the workflow and reduces manual setup time.

dlt windsurfing

davidj.substack • 71 implied HN points • 04 Dec 24

🕹 Technology Software Data Engineering AI Programming APIs

dlt is a Python tool that helps organize messy data into clear, structured datasets. It's easy to use and can quickly load data from many sources.
Using AI tools like Windsurf can make coding feel more collaborative. They help you find solutions faster and reduce the burden of coding from scratch.
Storing data in formats like parquet can make processing much quicker. Simplifying your data handling can save you a lot of time and resources.

sqlmesh migrate

davidj.substack • 47 implied HN points • 20 Dec 24

🕹 Technology Software Data Engineering Programming Cloud Computing Analytics

If you're using dbt to run analytics, switching to sqlmesh is a good idea. It offers more features and is easy to learn while still being compatible with dbt.
sqlmesh helps manage data environments and is more comprehensive in handling analytics tasks compared to dbt. It's simpler to transition from dbt to sqlmesh than from older methods like stored procedures.
When using sqlmesh, think about where to run it and how to store its state. You have choices like using a different database or a cloud service, which can save you money and hassle.

Data Science Weekly - Issue 549

Data Science Weekly Newsletter • 159 implied HN points • 31 May 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Cloud Computing Software Development

Mediocre machine learning can be very risky for businesses, as it may lead to significant financial losses. Companies need to ensure their ML products are reliable and efficient.
Understanding logistic regression can be made easier by using predicted probabilities. This approach helps in clearly presenting data analysis results, especially to those who may not be familiar with technical terms.
Data quality management is becoming essential in today's data-driven world. It's important to keep track of how data is tested and monitored to maintain trust and accuracy in business decisions.

Data Science Weekly - Issue 553

Data Science Weekly Newsletter • 99 implied HN points • 27 Jun 24

🕹 Technology Data science Artificial Intelligence Machine Learning Data Engineering Data Visualization

Data visualization can show important patterns, like changes in night and daylight globally. Understanding these trends helps us appreciate our environment better.
In AI engineering, simplifying data preparation is crucial. Many new AI applications can be built without structured data, which might lead to rushed expectations about their effectiveness.
Aquaculture technology is evolving with better methods to track and analyze fish behavior. New approaches like deep learning are making monitoring more accurate and efficient.

Growing From Analyst To Data Engineer In 100 Days

SeattleDataGuy’s Newsletter • 1165 implied HN points • 02 Jan 24

🕹 Technology Data Engineering Skills Development Community Building

Breaking into data engineering may be easier through lateral moves, like from data analyst to data engineer.
The 100-day plan discussed is not meant to master data engineering but to help commit to learning and identify areas for improvement.
The plan includes reviewing basics, diving deeper, building a mini project, surveying tools, best practices, and committing to a final project.

Data Science Weekly - Issue 541

Data Science Weekly Newsletter • 279 implied HN points • 05 Apr 24

🕹 Technology Data science AI Machine Learning Software Development Data Engineering

AI agents have unique challenges that traditional laws may not effectively solve. New rules and systems are needed to ensure they are managed properly.
JS-Torch is a new JavaScript library that makes deep learning easier for developers familiar with PyTorch. It allows building and training neural networks directly in the browser.
Data acquisition is crucial for AI start-ups to succeed. There are strategies outlined to help these businesses gather the right data efficiently.

GroupBy #41: Uber’s Batch Data Infrastructure with Google Cloud Platform

VuTrinh. • 99 implied HN points • 25 Jun 24

🕹 Technology Data Engineering Cloud Computing Machine Learning Infrastructure Analytics

Uber is moving its huge amount of data to Google Cloud to keep up with its growth. They want a smooth transition that won't disrupt current users.
They are using existing technologies to make sure the change is easy. This includes tools that will help keep data safe and accessible during the move.
Managing costs is a big concern for Uber. They plan to track and control spending carefully as they switch to cloud services.

Data Science Weekly - Issue 543

Data Science Weekly Newsletter • 219 implied HN points • 19 Apr 24

🕹 Technology Data science Machine Learning AI Analytics Data Engineering

Statistical ideas have a big impact on the world. Learning about important papers can help us understand how statistics shape modern research and decision-making.
Machine Learning teams have different roles that face unique challenges. Understanding these personas can help leaders support their teams better.
Using vector embeddings can greatly improve search experiences in apps. They simplify processes that previously seemed too complex and highlight their usefulness in technology.

I spent 5 hours understanding more about the Delta Lake table format

VuTrinh. • 179 implied HN points • 04 May 24

🕹 Technology Data Engineering Database Performance optimization Software Development

Delta Lake is designed to solve problems with traditional cloud object storage. It provides ACID transactions, making data operations like updates and deletions safe and reliable.
Using Delta Lake, data is stored in Apache Parquet format, allowing for efficient reading and writing. The system tracks changes through a transaction log, which keeps everything organized and easy to manage.
Delta Lake supports advanced features like time travel, allowing users to see and revert to past versions of data. This makes it easier to recover from mistakes and manage data over time.

Data Science Weekly - Issue 548

Data Science Weekly Newsletter • 139 implied HN points • 24 May 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Visualization Data Engineering

Good communication is key for statisticians to explain their complex work to non-experts. Finding ways to relate data to everyday situations can make it easier for others to understand.
Using histograms can speed up the training process for gradient boosted machines in data science. This simple technique can improve efficiency significantly.
There are efforts to use machine learning algorithms to detect type 1 diabetes in children earlier. This can help avoid serious health issues by improving recognition of symptoms.

GroupBy #38: Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform, Apache Iceberg - What Is It

VuTrinh. • 119 implied HN points • 04 Jun 24

🕹 Technology Data Engineering Cloud Computing Data Infrastructure Machine Learning Open Source

Uber is upgrading its data system by moving from its huge Hadoop setup to Google Cloud Platform for better efficiency and performance.
Apache Iceberg is an important tool for managing data efficiently, and it can help create a more organized data environment.
Building data products requires a strong foundation in data engineering, which includes understanding the tools and processes involved.

Procella - The query engine at YouTube

VuTrinh. • 79 implied HN points • 29 Jun 24

🕹 Technology Data Engineering Cloud Computing Database Systems Analytics

YouTube built Procella to combine different data processing needs into one powerful SQL query engine. This means they can handle many tasks, like analytics and reporting, without needing separate systems for each task.
Procella is designed for high performance and scalability by keeping computing and storage separate. This makes it faster and more efficient, allowing for quick data access and analysis.
The engine uses clever techniques to reduce delays and improve response times, even when many users are querying at once. It constantly optimizes and adapts, making sure users get their data as quickly as possible.

GroupBy #36: Agoda- How We Solve Load Balancing Challenges in Apache Kafka, How to reduce your Snowflake cost

VuTrinh. • 139 implied HN points • 21 May 24

🕹 Technology Data Engineering Software Development Cloud Computing Infrastructure Cost Optimization

Working on pet projects is fun, but it's important to have clear learning goals to actually gain knowledge from them.
When using tools like Spark or Airflow, always ask what problem they solve to understand their value better.
To make your projects more effective, think like a user and check if they get what they need from your data systems.

Introduction to Write-Audit-Publish Pattern

Data Engineering Central • 432 implied HN points • 15 Jan 24

🕹 Technology Data Engineering

The concept of Write-Audit-Publish (WAP) is being discussed for data pipelines.
The post explores whether the WAP pattern is worth implementing and considers alternative approaches.
Data Engineering Central emphasizes reader support for new posts and offers subscription options.

Data Science Weekly - Issue 539

Data Science Weekly Newsletter • 259 implied HN points • 22 Mar 24

🕹 Technology Data science AI Machine Learning Data Engineering Data Visualization

Data storytelling is important for sharing insights, and AI can help people create better stories. The research looks at how different tools assist in each storytelling stage.
Switching from R to Python in data science isn't just about learning new syntax; it's a mindset change. New Python tools can help make this transition smoother for users coming from R's tidyverse.
Emerging technologies often face skepticism, as seen throughout history. New inventions have raised concerns about their impact, but they eventually become part of everyday life.

Data Science Weekly - Issue 532

Data Science Weekly Newsletter • 379 implied HN points • 02 Feb 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Forecasting in data science is challenging because time series data can be non-stationary. Using the right evaluation methods can help bridge the gap between traditional and modern forecasting techniques.
It's important to consider the smartness of your data structures. Creating overly complicated dashboards that ultimately just produce simple outputs may not be the best use of time.
There are clear distinctions between well-built data pipelines and amateur setups. Understanding what makes a pipeline production-grade can improve the quality and reliability of data processing.

Why DuckDB is losing to Polars

Data Engineering Central • 373 implied HN points • 29 Jan 24

🕹 Technology Data Engineering Data Tools Community Data Pipelines Technology Adoption

Technology innovations come from solving problems and gaining popularity.
Community engagement and real-world usage are important factors in tool evaluation.
Polars is gaining traction over DuckDB due to its versatility and widespread adoption.

Monthly Python Data Engineering, September 2024

Monthly Python Data Engineering • 2 HN points • 26 Sep 24

🕹 Technology Data Engineering Open Source Python Software Development Libraries

A new free book called 'How Data Platforms Work' is being created for Python developers. It will explain the inner workings of data platforms in simple terms, with one chapter released each month.
The Ibis library has removed the Pandas backend and now uses DuckDB, which is faster and has fewer dependencies. This change is expected to improve performance and usability.
Several popular libraries in Python, such as GreatTables and Shiny, have released updates with new features and improvements, focusing on better usability and integration with modern technologies.

Data Science Weekly - Issue 533

Data Science Weekly Newsletter • 339 implied HN points • 09 Feb 24

🕹 Technology Data science Machine Learning Artificial Intelligence AI Research Data Engineering

Satellite data is important for machine learning and should be treated as a unique area of research. Recognizing this can help improve how we use this data.
Many data science and machine learning projects fail from the start due to common mistakes. Learning from past experiences can help increase the chances of success.
Open source software plays a crucial role in advancing AI technology. It's important to support and protect open source AI from regulations that could harm its progress.

LLMs Part 2 - Fine Tuning OpenLLaMA

Data Engineering Central • 393 implied HN points • 16 Jan 24

🕹 Technology Machine Learning Data Engineering AI GPU Model Training

LLMs require fine-tuning to adapt to specific tasks or styles.
Data Engineers play a vital role in preparing data for LLMs.
Training LLMs involves setting up environments, automating tasks, and requires a lot of data engineering skills.

How to grow from a mid-level to senior Data Engineer

SeattleDataGuy’s Newsletter • 694 implied HN points • 14 Feb 24

🕹 Technology Data Engineering Software Engineering Learning Teamwork Leadership

To grow from mid to senior level, it's important to continuously learn and improve, share new knowledge, work on code improvements, and become an expert in a certain domain.
Making the team better is crucial - focus on mentoring, sharing knowledge, and creating a positive team environment. Think beyond individual tasks to impact the overall team outcomes.
Seniority includes building not just technical solutions, but solutions that customers love. Challenge requirements, understand the business and product, and take initiative in problem-solving.

Data Science Weekly - Issue 544

Data Science Weekly Newsletter • 159 implied HN points • 26 Apr 24

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Data Visualization

Evaluating AI models can be expensive, but tools like lm-buddy and Prometheus help do it on cheaper hardware without high costs.
Installing and deploying LLaMA 3 is made simple with clear guides that cover everything from setup to scaling effectively.
Understanding best practices in machine learning is essential, and resources like the 'Rules of Machine Learning' provide valuable guidelines for beginners.

Finding Data Bugs in dbt Pull Requests

clkao@substack • 39 implied HN points • 17 Aug 24

🕹 Technology Data Engineering Software Development Open Source Machine Learning Quality Assurance

Data bugs can be costly for companies, with bad data potentially costing up to 25% of their revenue. These issues often arise from problems in data-centric systems like dbt.
Using dbt allows data engineers to implement software practices like version control and testing, helping to ensure the correctness of their data transformations. However, relying solely on post-processing tests has its limits.
Manual spot checks are still crucial in ensuring data accuracy during code reviews. Tools like Recce aim to streamline this process, making it easier for developers to validate and document their changes.

Are Data Contracts For Real?

Data Engineering Central • 294 implied HN points • 05 Feb 24

🕹 Technology Data Engineering Data Contracts APIs Data Quality Data Tools

Data Contracts may not be widely adopted in the data engineering community.
The idea behind Data Contracts is to enforce trustworthiness and consistency in data.
The challenge with Data Contracts seems to be the complexity and adoption of specific technologies.