The hottest Data Engineering Substack posts right now

And their main takeaways

LLMs Part 2 - Fine Tuning OpenLLaMA

Data Engineering Central • 393 implied HN points • 16 Jan 24

🕹 Technology Data Engineering

LLMs require fine-tuning to adapt to specific tasks or styles.
Data Engineers play a vital role in preparing data for LLMs.
Training LLMs involves setting up environments, automating tasks, and requires a lot of data engineering skills.

Is It Time to Say Goodbye to Data Engineers?

SeattleDataGuy’s Newsletter • 812 implied HN points • 06 Feb 25

🕹 Technology Data Engineering

Data engineers are often seen as roadblocks, but cutting them out can lead to major problems later on. Without them, the data can become messy and unmanageable.
Initially, removing data engineers may seem like a win because things move quickly. However, this speed can cause chaos as data quality suffers and standards break down.
A solid data strategy needs structure and governance. Rushing without proper planning can lead to a situation where everything collapses under the weight of disorganization.

Data Science Weekly - Issue 544

Data Science Weekly Newsletter • 159 implied HN points • 26 Apr 24

🕹 Technology Data Engineering

Evaluating AI models can be expensive, but tools like lm-buddy and Prometheus help do it on cheaper hardware without high costs.
Installing and deploying LLaMA 3 is made simple with clear guides that cover everything from setup to scaling effectively.
Understanding best practices in machine learning is essential, and resources like the 'Rules of Machine Learning' provide valuable guidelines for beginners.

Finding Data Bugs in dbt Pull Requests

clkao@substack • 39 implied HN points • 17 Aug 24

🕹 Technology Data Engineering

Data bugs can be costly for companies, with bad data potentially costing up to 25% of their revenue. These issues often arise from problems in data-centric systems like dbt.
Using dbt allows data engineers to implement software practices like version control and testing, helping to ensure the correctness of their data transformations. However, relying solely on post-processing tests has its limits.
Manual spot checks are still crucial in ensuring data accuracy during code reviews. Tools like Recce aim to streamline this process, making it easier for developers to validate and document their changes.

Speed Without Understanding - One of the Biggest Risks in Data Engineering

SeattleDataGuy’s Newsletter • 329 implied HN points • 30 Jun 25

🕹 Technology Data Engineering

Speed in data engineering can be risky. Acting fast without fully understanding the consequences can lead to mistakes, like accidentally deleting important data.
Every new tool or change can add complexity. If something breaks, it may cause confusion for others, so it’s important to think carefully about what you build.
Having a mix of experienced and new team members is really helpful. It encourages sharing knowledge and can prevent big errors when someone leaves the team.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Are Data Contracts For Real?

Data Engineering Central • 294 implied HN points • 05 Feb 24

🕹 Technology Data Engineering

Data Contracts may not be widely adopted in the data engineering community.
The idea behind Data Contracts is to enforce trustworthiness and consistency in data.
The challenge with Data Contracts seems to be the complexity and adoption of specific technologies.

Beyond Big Tech: The Reality Of Data Engineering Outside Silicon Valley

SeattleDataGuy’s Newsletter • 847 implied HN points • 14 Dec 24

🕹 Technology Data Engineering

Working in big tech offers many advantages like better tools and a strong focus on data. This environment makes it easier to get work done quickly and efficiently.
Many companies outside big tech struggle with data because it's not their main focus. They often use a mix of different tools that don't work well together, leading to confusion.
Without a strong data leader, companies may find it hard to prioritize data spending. If data isn't tied to profits, it's tougher to justify investing time and money into it.

Data Science Weekly - Issue 546

Data Science Weekly Newsletter • 119 implied HN points • 10 May 24

🕹 Technology Data Engineering

Time-series analysis and Gaussian processes are powerful tools for interpreting data. They allow for flexibility and control in modeling data, making them essential for data practitioners.
Understanding A/B testing is crucial for making informed business decisions. Using a reliable experimentation system can save time and lead to better results.
New advancements in AI and data science are enhancing applications in various fields, like biomedical research and recommendation systems. These innovations help combine human creativity with machine learning capabilities.

Open Source Data Engineering Landscape 2024

Practical Data Engineering Substack • 299 implied HN points • 28 Jan 24

🕹 Technology Data Engineering

The open-source data engineering landscape is growing fast, with many new tools and frameworks emerging. Staying updated on these tools is important for data engineers to pick the best options for their needs.
There are different categories of open-source tools like storage systems, data integration, and workflow management. Each category has established players and new contenders, helping businesses solve specific data challenges.
Emerging trends include decoupling storage and compute resources and the rise of unified data lakehouse layers. These advancements make data storage and processing more efficient and flexible.

Data Science Weekly - Issue 540

Data Science Weekly Newsletter • 179 implied HN points • 29 Mar 24

🕹 Technology Data Engineering

SQL is seen as an easier way to write relational algebra, but it's not ideal for building new query tools. Understanding its limits can help in learning and using SQL better.
Many successful companies have developed their own AI models, showing a trend in the tech industry. Knowing about these companies can give insights into future developments in AI.
Binary vector search methods can save a lot of memory compared to traditional methods. However, it's important to balance memory savings with maintaining accuracy.

5 Habits Of Highly Effective Data Engineers To Master in 2025

SeattleDataGuy’s Newsletter • 800 implied HN points • 20 Dec 24

🕹 Technology Data Engineering

Being proactive means solving problems before they become bigger issues. If you see something that can be improved, go ahead and make that change instead of waiting for someone else to do it.
Make sure your contributions are visible, so people recognize your work. Share your successes and updates with your team and leadership to build a stronger reputation.
Become the go-to person for a specific area in your company. Focus on something valuable that can help others succeed, and make sure to share your knowledge and support with your team.

Data Science Weekly - Issue 538

Data Science Weekly Newsletter • 199 implied HN points • 14 Mar 24

🕹 Technology Data Engineering

Serverless computing can handle big tasks without limits, but it also brings challenges like managing large uploads effectively.
Art careers can be influenced by the reputation of institutions, with established artists facing less access to elite spaces early on compared to newcomers.
Learning about LLM evaluation metrics can help improve understanding and performance when working with large language models.

Data Science Weekly - Issue 525

Data Science Weekly Newsletter • 359 implied HN points • 15 Dec 23

🕹 Technology Data Engineering

Learning about causal models is important in data analysis because it helps explain what caused the data. This understanding can improve how we interpret results using Bayesian methods.
There's growing concern over data privacy in AI tools like Dropbox. Users are worried their private files could be used for AI training, even though companies deny this.
Netflix recently held a Data Engineering Forum to share best practices. They discussed ways to improve data pipelines and processing, which could benefit many in the data engineering community.

Apache Iceberg Isn't Coming To Save You

SeattleDataGuy’s Newsletter • 341 implied HN points • 27 May 25

🕹 Technology Data Engineering

Apache Iceberg might seem appealing, but it won't automatically solve your data problems. It's important to really understand what issues you're trying to address before jumping in.
Switching to new tools like Iceberg won't fix a broken data strategy. The focus should be on delivering real business value, not just adopting the latest technology.
If your data team is already doing well and looking to improve, Iceberg could be useful. But make sure it's the right fit for your specific challenges instead of following trends.

Issue #2 - The Data Ecosystem: Where do you even start?

The Data Ecosystem • 119 implied HN points • 21 Apr 24

🕹 Technology Data Engineering

Data can be really complicated, and it's easy to miss how everything connects. People often focus on their own area and forget about the bigger picture of the data ecosystem.
Chief Data Officers (CDOs) are important but can only do so much to fix data issues. They deal with many challenges, including limited power, lack of experience, and politics within the organization.
To improve in the data field, we need to recognize the gaps in our knowledge, prioritize what to focus on, and continuously educate ourselves in both our own areas and related data domains.

7 Lessons I Learned the Hard Way From 9+ Years as a Data Engineer

SeattleDataGuy’s Newsletter • 730 implied HN points • 21 Nov 24

🕹 Technology Data Engineering

It's important to avoid building complex systems just for the sake of it. Focus on creating infrastructure that actually helps your team and the business.
If you don’t plan your data model, you’ll end up with a messy one. Always take the time to design it properly to make future work easier.
Good communication is really powerful. Being able to share your ideas clearly can help you get support and make a bigger impact in your projects.

SwirlAI Table of Contents

SwirlAI Newsletter • 432 implied HN points • 28 Jun 23

🕹 Technology Data Engineering

The newsletter provides a Table of Contents with more than 90 topics, making it easier to find the content of interest.
Topics covered include Data Engineering fundamentals, Spark architecture, Kafka use cases, MLOps deployment processes, System Design examples, and more.
If interested, it's recommended to support the author's work by subscribing and sharing the content.

Long-Chain Marketing: How Data Engineering & Data Management Create Value For The Business

High ROI Data Science • 297 implied HN points • 10 Jan 24

💼 Business Data Engineering

Understanding the long-chain in marketing is crucial for connecting business outcomes with data and metrics.
Data engineering and knowledge management are essential for transforming data into valuable assets that can be monetized by the business.
Long-chain marketing involves seeing marketing efforts as part of a longer sequence of actions that lead to business outcomes, rather than standalone events.

GroupBy #39: 2000+ DBT models in airflow; Serverless Jupyter Notebooks at Meta

VuTrinh. • 59 implied HN points • 11 Jun 24

🕹 Technology Data Engineering

Meta has developed a serverless Jupyter Notebook platform that runs directly in web browsers, making data analysis more accessible.
Airflow is being used to manage over 2000 DBT models, which helps teams create and maintain their own data models effectively.
Building a data platform from scratch can be a valuable learning experience, revealing important lessons about data structure and management.

SAI Notes #07: What is a Vector Database?

SwirlAI Newsletter • 412 implied HN points • 18 Jun 23

🕹 Technology Data Engineering

Vector Databases are essential for working with Vector Embeddings in Machine Learning applications.
Partitioning and Bucketing are important concepts in Spark for efficient data storage and processing.
Vector Databases have various real-life applications, from natural language processing to recommendation systems.

Data Science Weekly - Issue 518

Data Science Weekly Newsletter • 379 implied HN points • 27 Oct 23

🕹 Technology Data Engineering

Web development is evolving with the use of local models and technologies for building applications, moving beyond just Python-based machine learning.
It's becoming increasingly important for developers to understand GPUs since they're widely used in deep learning and can greatly enhance performance.
Companies are exploring various use cases for generative AI that provide real value, focusing on practical implementations that drive return on investment.

Data Science Weekly - Issue 531

Data Science Weekly Newsletter • 219 implied HN points • 26 Jan 24

🕹 Technology Data Engineering

AI often gets criticized for the quality of its output, but that might not be the real issue people have with it. If quality is fixed, the conversation about AI could change significantly.
Common sense is tricky to define and measure, but researchers are developing ways to quantify it both individually and collectively. This could help clarify how we understand common sense in different contexts.
Large language models (LLMs) can transform education by encouraging hands-on learning. They offer opportunities for more interactive and engaging learning experiences.

Data Science Weekly - Issue 524

Data Science Weekly Newsletter • 299 implied HN points • 08 Dec 23

🕹 Technology Data Engineering

Data engineering is evolving with new design patterns that help improve efficiency in handling data. A new book dives into these patterns and their importance.
Machine learning is being used to understand and control the movement of silicon atoms in materials, which could lead to advancements in technology like better electronics.
A new model called PoseGPT can estimate 3D human poses from images and text, linking physical movements to broader concepts about humans, showing the capabilities of large language models.

Running dbt Core on EC2

The Orchestra Data Leadership Newsletter • 79 implied HN points • 16 May 24

🕹 Technology Data Engineering

Guide on running dbt Core on AWS EC2 using Orchestra, with setup and monitoring steps
Key infrastructure requirements for hosting dbt Core on EC2 with Orchestra
IAM permissions needed for setting up Orchestra and the EC2 instance to run dbt Core commands

Why Today Is The Perfect Time to Learn Data | Seattle Data Guy

Data Analysis Journal • 373 implied HN points • 25 Oct 23

🕹 Technology Data Engineering

Learning data is more accessible and better now than in the past years.
For transitioning into data engineering, focus on SQL, programming, data warehouse, and data pipelines.
Analysts should focus on understanding the business problem, building maintainable systems, and following a data analytics process.

Artificial Intelligence is ushering in a new era of web scraping possibilities

The Orchestra Data Leadership Newsletter • 79 implied HN points • 14 May 24

🕹 Technology Data Engineering

Artificial Intelligence is revolutionizing web scraping by offering accelerated development processes and increased adoption of scraping use-cases in Data.
The complexity of parsing HTML and the challenges associated with web scraping, such as changing schemas, time durations, and legality, can be mitigated with AI-enabled tools.
AI-enabled web scraping tools like Nimble and Diffbot provide reliable solutions for efficiently extracting data from the internet and handling challenges like managing proxies and optimizing scraping speed.

MLOps Basics - For Data Engineers.

Data Engineering Central • 393 implied HN points • 15 May 23

🕹 Technology Data Engineering

Working on Machine Learning as a Data Engineer is not as hard as it seems - it falls somewhere in the middle of difficulty.
Machine Learning work for Data Engineers focuses on MLOps like feature stores, model prediction, automation, and metadata storage.
The key aspects of MLOps include automating tasks, using tools like Apache Airflow, and managing metadata for a stable ML environment.

Levels of Data Freshness in Machine Learning Systems

SwirlAI Newsletter • 373 implied HN points • 09 Jul 23

🕹 Technology Data Engineering

Data freshness is crucial in machine learning systems to provide accurate and valuable insights.
Different levels of feature freshness exist in ML systems, each with its own investments and complexities.
Starting with simpler models and gradually moving to more real-time systems can be more cost-effective and efficient.

SAI #26: Partitioning and Bucketing in Spark (Part 1)

SwirlAI Newsletter • 373 implied HN points • 15 Apr 23

🕹 Technology Data Engineering

Partitioning and bucketing are two key data distribution techniques in Spark.
Partitioning helps improve performance by allowing skipping reading the entire dataset when only a part is needed.
Bucketing is beneficial for collocating data and avoiding shuffling in operations like joins and groupBys.

Data Science Weekly - Issue 537

Data Science Weekly Newsletter • 139 implied HN points • 07 Mar 24

🕹 Technology Data Engineering

The newsletter shares valuable links about Data Science, AI, and Machine Learning each week. It's a great way to keep updated on the latest in the field.
There are interesting articles highlighting statistical analyses and practical guides, like building GPU clusters at home. These resources help both beginners and experienced practitioners learn more.
The newsletter also encourages people to participate in AI-related events and offers resources for job seekers. This can help you connect with others and grow your career.

GroupBy #37: Composable data management at Meta, How Uber Accomplishes Job Counting At Scale

VuTrinh. • 59 implied HN points • 28 May 24

🕹 Technology Data Engineering

When learning something new, it's good to start by asking yourself why you want to learn it. This helps set clear goals and expectations.
Focusing on one topic at a time can make learning easier. Instead of spreading your time thin, dive deep into one subject.
It's okay to feel stuck sometimes while learning. Just keep pushing through, relax, and remember that learning is a journey that takes time.

Introduction To Analytics Engineering

Data Analysis Journal • 353 implied HN points • 22 Mar 23

🕹 Technology Data Engineering

Analytics engineers bridge the gap between data engineers and data analysts by focusing on producing high-quality data.
Analytics engineers use tools like dbt to streamline data modeling, testing, and documentation.
Data quality is crucial in decision-making, making analytics engineering more important than ever.

Data Science Weekly - Issue 509

Data Science Weekly Newsletter • 399 implied HN points • 25 Aug 23

🕹 Technology Data Engineering

Each week, a newsletter shares important links and articles about data science, machine learning, and AI. It's a good way to keep updated on new happenings in the field.
The newsletter features articles on various topics, including programming, AI forecasting, and data management practices. These articles are meant to help both newcomers and experienced professionals.
Job listings and training resources are also provided, helping readers find opportunities and learn new skills beneficial for their careers in data science.

Data Science Weekly - Issue 514

Data Science Weekly Newsletter • 339 implied HN points • 29 Sep 23

🕹 Technology Data Engineering

Data science involves a mix of techniques for analyzing and visualizing data which can help make informed decisions.
Learning about advanced customer segmentation methods can enhance how businesses understand and target their customers.
There are various roles in data-related careers beyond just being a data scientist, so it's good to explore different paths.

Data Science Weekly - Issue 519

Data Science Weekly Newsletter • 299 implied HN points • 03 Nov 23

🕹 Technology Data Engineering

Companies are increasingly sharing their advanced AI models openly, which can help them improve and build better products. This open sharing can lead to a more cooperative tech environment.
Data science job applications are extremely competitive, with many positions receiving thousands of applicants within a day. This shows a high interest and demand in the data science field.
Exploring advanced tools and frameworks in AI can be complex, but understanding how they work can help in building effective applications, especially in question-answering systems.

Why did Databricks build the Photon engine?

VuTrinh. • 99 implied HN points • 06 Apr 24

🕹 Technology Data Engineering

Databricks created the Photon engine to simplify data management by combining the benefits of data lakes and data warehouses into one system called the Lakehouse. This makes it easier and cheaper for companies to manage their data all in one place.
Photon is designed to handle various types of raw data and is built with a vectorized approach instead of the traditional Java methods. This means it can work faster and better with different kinds of data without getting bogged down.
To ensure that existing customers using Apache Spark can easily switch to Photon, the new engine is integrated with Spark’s system. This allows users to continue running their current processes while benefiting from the speed of Photon.

Lean Data Engineering with Dagster and DuckDB

The Data Jargon Newsletter • 158 implied HN points • 05 Mar 24

🕹 Technology Data Engineering

Data lakes can be convenient but often lead to problems when trying to manage the data effectively. Keeping things simple with familiar tools can help make the data more useful.
Using Dagster and DuckDB allows you to process data efficiently without complicated setups. You can do key tasks like aggregation and data cleaning right in your data flow.
It's important to consider memory limits and choose the right file formats, like Parquet, for better processing. This way, you can keep your data pipeline running smoothly and avoid needless costs.

Why Your Data Infrastructure Migration Project Will Fail (And How to Succeed)

SeattleDataGuy’s Newsletter • 376 implied HN points • 12 Feb 25

🕹 Technology Data Engineering

Having a clear plan is crucial for successful data migration projects. You need to know what to move and in what order to avoid chaos.
Ownership of the migration process is important. There should be a clear leader or team responsible to keep everything on track.
Testing data after migration is a must. Just moving the data doesn't guarantee that it works the same way, so check for any discrepancies.

Kickstart Your Data Engineering Career: 9 Resources and Templates You Need

SeattleDataGuy’s Newsletter • 541 implied HN points • 14 Nov 24

🚌 Education Data Engineering

Use the 100-Day Data Engineering Crash Course to start learning the basics of data engineering. It covers important topics like SQL, programming, and Cloud technologies.
Creating your own data projects can help you stand out. The Data Engineering Side Project Idea Template will guide you in planning unique projects that add value.
Prepare well before job interviews with the Data Engineer Interview Study Guide. Always check with the recruiter about what to study so you can be ready.

Data Science Weekly - Issue 508

Data Science Weekly Newsletter • 379 implied HN points • 18 Aug 23

🕹 Technology Data Engineering

Writing clear and effective research papers is essential, and there are tips specifically for NLP papers that can help improve your writing skills.
The job market for data-related roles has changed over the years, and analyzing hiring trends can provide insights into what skills and positions are in demand.
Understanding AI hardware is important because it forms the backbone of many AI models, and knowing how it works can help in making better tech decisions.