The hottest Data Engineering Substack posts right now

And their main takeaways

Data For AI

Gradient Flow • 59 implied HN points • 31 Mar 22

🕹 Technology Data Engineering

Data engineering and data infrastructure are foundational for AI and machine learning success. Businesses need to focus on data integration to scale their use of AI and machine learning.
New tools and frameworks like DoWhy for causal inference and the AI Risk Management Framework from NIST are shaping how we manage AI risks and explore causal learning.
State-of-the-art AI systems require additional training data to achieve top-notch results across various benchmarks. Additional data is crucial for enhancing AI performance.

Open Source has my Whole Heart

Sung’s Substack • 3 HN points • 08 May 24

🕹 Technology Data Engineering

Open source is a beautiful pursuit that allows people to solve problems they love while connecting with others.
Career paths can evolve, leading to new opportunities and self-discovery in pursuing work that aligns with personal values and passions.
Improvements in data tools and workflows, like understanding SQL deeply and prioritizing statefulness, can revolutionize data work and make processes more intuitive and efficient.

Once more unto the breach...

davidj.substack • 71 implied HN points • 01 Mar 23

🕹 Technology Data Engineering

Automating simple data requests can save time and effort in data analysis workflows.
Effective self-serve BI tools require reliable data pipelines, transformation processes, and a semantic layer.
Text-to-SQL tools can improve automation but may require caution due to potential errors and lack of trust in results.

It's still worth running a small server in 2023

Counting Stuff • 54 implied HN points • 11 Jul 23

🕹 Technology Data Engineering

It is beneficial to have familiarity with running a small server to learn skills and appreciate the work of Ops and SRE professionals.
Consider the value of running a small server for hosting personal projects like a homepage or resume.
Exploring web-based RSS apps can help manage information overload and stay updated with blogs and newsletters.

Memphis: Building Stream Processing that Doesn't Suck (aka a Delightful Dev First Experience)...Welcome to the Boldstart Family!

Software Snack Bites • 50 implied HN points • 28 Jun 23

🕹 Technology Data Engineering

Memphis provides a better developer experience for stream processing.
Memphis is designed for quick setup, cost efficiency, and user-friendly monitoring.
Memphis is a platform of choice for companies looking to replace or enhance their streaming platforms.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Data Science Weekly - Issue 472

Data Science Weekly Newsletter • 19 implied HN points • 08 Dec 22

🕹 Technology Data Engineering

Machine learning can unintentionally develop biases from training data, which is important to detect and fix, especially in critical areas like healthcare and self-driving cars.
Google Sheets now offers a way to use machine learning without coding skills, making it accessible for everyone to perform simple data tasks like predicting values and identifying anomalies.
There is a trend in tech companies to make machine learning processes happen in real-time, which can lead to faster and more efficient data insights.

Data Science Weekly - Issue 470

Data Science Weekly Newsletter • 19 implied HN points • 24 Nov 22

🕹 Technology Data Engineering

Using recommender systems can lead to problems like clickbait and addiction if they're only focused on engagement. We need to think differently to create better systems that really serve people's needs.
GitLab has a detailed Data Team Handbook that explains how their data team works, what data is available, and how it helps different departments make decisions. This can guide other teams looking to improve their data processes.
Deep learning techniques are being researched to playtest video games like Candy Crush. This shows how AI can create more human-like testing methods and improve the gaming experience.

Data Science Weekly - Issue 455

Data Science Weekly Newsletter • 19 implied HN points • 11 Aug 22

🕹 Technology Data Engineering

Data professionals spend a lot of time checking data quality, which costs companies a lot of money every year. Poor data quality can affect a company's revenue significantly.
Understanding how AI models behave is important for data scientists. They need to develop good mental models to train and work effectively with these systems.
Vector search is becoming popular in retail for improving various aspects like revenue and customer satisfaction. It helps teams make better use of their data.

Data Science Weekly - Issue 452

Data Science Weekly Newsletter • 19 implied HN points • 21 Jul 22

🕹 Technology Data Engineering

The role of data scientist remains popular and well-paid, with growth expected in the field by 2029.
Large language models (LLMs) are rapidly evolving and are becoming integral to various applications in our daily lives.
Many industries are seeing the rise of domain experts who can now create and work with deep learning models without needing advanced degrees.

Data Science Weekly - Issue 450

Data Science Weekly Newsletter • 19 implied HN points • 07 Jul 22

🕹 Technology Data Engineering

AI forecasting contests help predict future progress and improve forecasting skills. It’s important to evaluate predictions against actual outcomes to see how accurate forecasters are.
Analytics engineering has become a popular job choice, shifting from being less desired to highly sought after. This change reflects the growing need for skilled professionals in data analytics.
High-quality machine translation is now possible for low-resource languages through models like NLLB-200. This will make information more accessible to speakers of these languages worldwide.

Data Science Weekly - Issue 448

Data Science Weekly Newsletter • 19 implied HN points • 23 Jun 22

🕹 Technology Data Engineering

Machine learning can help the IRS process a huge amount of tax data more efficiently, improving enforcement actions on tax compliance.
Denoising Diffusion Probabilistic Models are showing great success in generating images and audio, making them popular in creative AI applications like DALL-E 2.
Training and developing skills in SQL can greatly enhance your data handling abilities, leading to better opportunities in data analysis and engineering.

Prompt-Based Feature Engineering Part 1: Generative AI Generates Data

nick’s datastack • 1 HN point • 24 Apr 24

🕹 Technology Data Engineering

Generative AI can generate data, impacting workflows and pipelines significantly.
Using LLMs for prompt-based feature engineering can save time and effort compared to traditional methods like manual data searching and merging.
While LLMs in data pipelines may feel magical, it's important to be cautious of potential inaccuracies due to the probabilistic nature of AI outputs.

Data Science Weekly - Issue 438

Data Science Weekly Newsletter • 19 implied HN points • 14 Apr 22

🕹 Technology Data Engineering

The Modern Data Stack is becoming crucial for handling data, with many tools available to improve the way businesses work with data. It helps users understand how to start using these tools effectively.
DeepMind's AlphaFold is revolutionizing biology by accurately predicting protein shapes. This technology is changing how researchers approach biological problems.
There are better ways to visualize SQL joins than using Venn diagrams. New methods like the checkered flag diagram can make understanding joins easier and clearer.

Data Science Weekly - Issue 436

Data Science Weekly Newsletter • 19 implied HN points • 31 Mar 22

🕹 Technology Data Engineering

Aggregating data can hide important details and context. It's better to focus on specific aspects of the data to find deeper insights.
Waymo is testing fully autonomous vehicles in San Francisco. This effort aims to integrate self-driving technology into everyday life for its employees.
AI can help improve representation on platforms like Wikipedia. A new approach is being developed to ensure more diverse biographies are created.

Data Science Weekly - Issue 427

Data Science Weekly Newsletter • 19 implied HN points • 27 Jan 22

🕹 Technology Data Engineering

Using offline replay experimentation can help predict results faster, cutting down the time usually needed for online experiments.
Bad data can seriously affect business operations, and understanding how it breaks is crucial for fixing dashboards and reports.
Shapley values can explain machine learning models by distributing how each feature contributes to predictions, making the model's decisions clearer.

Gradient Flow #41: What’s New in Data Engineering; MLOps Anti-Patterns

Gradient Flow • 19 implied HN points • 12 Aug 21

🕹 Technology Data Engineering

The podcast discusses changes in the data science role and tools, along with insights on new data engineering trends.
An overview of new developments in tools and infrastructure, including a chatbot, recommendation system, and MLOps anti-patterns to avoid mistakes.
Recommendations cover topics like the evolution of PyTorch, guidelines for open datasets stewardship, and insights into the analytical application stack.

Most Data Engineers are Mid

Data Engineering Central • 2 HN points • 19 Jul 23

🕹 Technology Data Engineering

Most Data Engineers tend to be at a mid-level in their career.
Key characteristics to avoid in Data Engineering include never acting, poor quality work, and a lack of continuous learning.
Becoming a Lone Wolf in data engineering can hinder team cohesion and overall efficiency.

Data Science Weekly - Issue 362

Data Science Weekly Newsletter • 19 implied HN points • 29 Oct 20

🕹 Technology Data Engineering

Form extraction using AI can help important fields like journalism and medicine by accurately pulling data from documents. This can significantly improve research and decision-making.
Data engineering is crucial and involves gathering, cleaning, and shaping data before it's analyzed. It's just as important as data science, which builds on that data to create insights and models.
Dealing with data imbalance can be tricky, but using semi-supervised and self-supervised learning techniques can improve model performance. These methods help when some categories have much less data than others.

Open Models, Smarter Math, and Negotiation LLMs

ppdispatch • 2 implied HN points • 03 Jan 25

🕹 Technology Data Engineering

Yi is a new set of open foundation models that can handle many tasks involving text and images. They have been carefully designed to improve performance through better training.
Researchers found that some AI models think too much for simple math problems. A new method can help these models solve problems faster and more efficiently.
AgreeMate is a smart AI tool that teaches models how to negotiate prices like humans. It helps them use strategies to get better deals.

Why Data Quality Is More Important Than Ever in an AI-Driven World

Data Products • 5 implied HN points • 08 Jan 24

🕹 Technology Data Engineering

Data quality is crucial for machine learning projects and can have negative impacts on both society and individuals.
Advances in Generative AI highlight the importance of high-quality data and the potential shortage of such data.
Data quality affects the machine learning product development cycle, including ongoing maintenance costs of ML pipelines.

The Data Quality Resolution Process

Data Products • 3 implied HN points • 11 Dec 23

🕹 Technology Data Engineering

Stakeholders surface data quality issues, managers must balance responsiveness without burnout.
Prioritize issues by determining urgency, impact, and potential solutions.
Communicate clearly with technical stakeholders, implement fixes cautiously, and maintain trust through thorough communication.

The Consumer-Defined Data Contract

Data Products • 3 implied HN points • 04 Dec 23

🕹 Technology Data Engineering

Producers need to move towards consumer-defined data contracts to improve data quality and alignment with user needs.
A phased approach of awareness, collaboration, and contract ownership helps in successful data contract adoption.
Starting with consumer-defined contracts drives communication, awareness, and problem visibility, leading to long-term benefits.

Data Contracts Book - Early Release

Data Products • 2 implied HN points • 27 Feb 24

🕹 Technology Data Engineering

Chad Sanderson announced an upcoming book on Data Contracts with O'Reilly, covering topics like what data contracts are, how they work, implementation, examples, and the future implications. The book will delve into Data Quality and Governance.
The first two chapters of the book are available for free on the O'Reilly website. They cover the importance of data contracts and the real goals of data quality initiatives, totaling about 45 pages of content.
Chad Sanderson is currently selecting technical reviewers for the book. Interested individuals can reach out to him to share their thoughts on an advance copy.

How to build your LLM app

ingest this! • 2 HN points • 07 Feb 24

🕹 Technology Data Engineering

Learn about the architecture of LLM applications.
Watch an introduction to Large Language Models by Andrej Karpathy.
Use Label Studio for enhancing data labelling with a simple UI.

Data Science Weekly - Issue 131

Data Science Weekly Newsletter • 19 implied HN points • 26 May 16

🕹 Technology Data Engineering

Artificial neural networks are being trained to reconstruct films by analyzing individual frames, which is a fun way to push the boundaries of AI. It's like teaching computers to understand and recreate stories visually.
Instead of programming computers in the traditional way, future advancements suggest we will train AI more like we train pets, making it more intuitive and interactive. This could change how we interact with technology.
There are tons of resources available for both beginners and experts in data science, from learning Python to understanding deep learning setups, making it easier for anyone to get started. Knowing where to look can help you dive into this field effectively.

Build data apps with markdown and SQL

ingest this! • 1 HN point • 19 Feb 24

🕹 Technology Data Engineering

Build data apps using markdown and SQL with Evidence framework, offering a way to create polished data products.
Explore the future synergy of knowledge graphs and large language models (LLMs) for enhanced technologies.
Engage with the latest in data engineering by checking out a full exploration of the open-source data engineering landscape for 2024.

Data is not a Microservice

Data Products • 1 HN point • 07 Jul 23

🕹 Technology Data Engineering

Data requires a source of truth that microservices cannot inherently provide without a shift in software engineering practices
Not all data is equally valuable, so treating all data as microservices can be costly and restrictive
The data development lifecycle differs from software development, requiring flexibility, reuse, and tight coupling that conflict with typical microservices architecture

Writing unit tests for SQL queries

Reflective Software Engineering • 0 implied HN points • 12 Jan 24

🕹 Technology Data Engineering

Having unit tests for SQL queries can help catch bugs introduced during code refactorings or changes.
When writing unit tests for SQL queries, focus on testing the specific parts responsible for building the query rather than the entire method.
Refactoring code for testability can involve moving pure functions outside of the class for easier testing and simplifying methods to focus on specific tasks.

Test-driven data engineering

Reflective Software Engineering • 0 implied HN points • 30 Dec 23

🕹 Technology Data Engineering

Test-driven development (TDD) is a valuable tool for ensuring software quality and driving great software design.
Testing data integrations and clients, especially in complex data platforms, can be challenging due to less control over underlying databases. Strategies like mocking HTTP interactions can help in testing.
Separating concerns and creating small, testable units of code can enhance confidence in the system, reduce fear of regression, and improve overall software quality.

Rust for Data Engineering

ingest this! • 0 implied HN points • 12 Mar 24

🕹 Technology Data Engineering

Rust is reshaping data engineering by offering performance, safety, and concurrency, making it a strong contender alongside languages like Python.
Learning Rust through 'The Rust Programming Language' book provides a solid foundation, with hands-on projects to enhance understanding.
Mathesar is an open-source tool providing a spreadsheet-like interface to PostgreSQL databases, making data collaboration easier and more accessible.

The Significance of In-Broker Data Transformations in Streaming Data

Tributary Data • 0 implied HN points • 28 Aug 23

🕹 Technology Data Engineering

Data scrubbing in streaming data pipelines is essential for cleaning and processing data in real-time to ensure it's ready for consumption.
In-broker data transformations powered by WebAssembly (Wasm) are revolutionizing how data processing tasks are handled in streaming data platforms, reducing dependency on external systems.
WebAssembly (Wasm) provides developers with flexibility, performance, security, and portability benefits for server-side processing in frameworks like Redpanda Data Transforms, streamlining data processing tasks within brokers.

Making Data; Three Data Point Thursday #87

Three Data Point Thursday • 0 implied HN points • 23 Mar 23

🕹 Technology Data Engineering

Remember the importance of 'making data' in addition to making data useful.
Be cautious about blindly adopting BSL licenses and consider the implications.
Consider the efficiency and flexibility of doing computing tasks on top of your data store, rather than within external tools.

Why you need to ditch dbt for SQLMesh today

Three Data Point Thursday • 0 implied HN points • 15 Jun 23

🕹 Technology Data Engineering

Building products with LLMs is challenging and requires addressing multiple issues.
PandasAI offers AI-powered features for data analysis, focusing on integrating LLMs smartly into products.
Consider switching to SQLMesh from dbt, especially if you are a data engineer or data scientist needing a more developer-focused analytics tool.

Simplifying and Streamlining Python Dashboards with ChatGPT

Data at Depth • 0 implied HN points • 26 Apr 23

🕹 Technology Data Engineering

Using ChatGPT-4 with specific instructions can save time on code research, debugging, and commenting.
It enables you to focus more on the larger solution rather than getting bogged down by smaller issues.
The article delves into how prompt engineering can streamline data visualization tasks.

The Unstructured Data Funnel

The Orchestra Data Leadership Newsletter • 0 implied HN points • 15 Dec 23

🕹 Technology Data Engineering

Unstructured data, like text documents and deeply nested JSON, is a crucial component in data processing for large cloud vendors like Snowflake and Databricks. The location where unstructured data is processed within the data pipeline greatly impacts the compute costs and revenue for these companies.
Processing unstructured data involves a series of stages, from data movement to storage in object storage, then to structured data warehouses. Each stage of this 'funnel' affects computational requirements and costs, with the most logical point for processing unstructured data being at the object storage level.
The final step in the data funnel, data activation, involves the least computational demands as it deals with cleaned and aggregated data ready for analytical applications. Thinking strategically about the processing location of unstructured data can help optimize costs and efficiency in data workflows.

Integrating the Airbyte Server with Orchestra

The Orchestra Data Leadership Newsletter • 0 implied HN points • 05 Dec 23

🕹 Technology Data Engineering

ETLP paradigm integrates Airbyte with dbt and Orchestra for quick end-to-end data pipelines without coding.
Using a fully managed deployment approach with tools like Airbyte, dbt, and Orchestra can save time and effort compared to self-managed solutions.
For a data product with 10GB data, costs for Airbyte, dbt, and Orchestra would be around $2400 monthly, potentially more cost-effective than hosting or developer time.

Understanding your Big Data problem: Transforming Big Data with Incremental Models

The Orchestra Data Leadership Newsletter • 0 implied HN points • 31 Oct 23

🕹 Technology Data Engineering

Understanding the importance of incremental models for managing big data is crucial to efficiently running complex queries and maintaining data quality.
Design patterns in data modeling, such as Star Schema and Data Vault, play a significant role in how dbt models are structured and managed.
Using Jinja templating and implementing continuous data integration processes are key elements in handling big models effectively and ensuring data reliability.

Lessons We Learned While Building A Stateful Kafka Connector

Bytewax • 0 implied HN points • 20 Apr 23

🕹 Technology Data Engineering

Writing a custom input connector for Bytewax involves answering important questions related to partitions, source building, and resuming states
Utilizing Bytewax's recovery system for failure recovery requires proper snapshotting and understanding of how to resume reading from a specific spot
Delivery guarantees in Bytewax are at-least-once by default, and ensuring exactly-once processing may require coordination with the output connector

The hottest Data Engineering Substack posts right now

Gradient Flow • 59 implied HN points • 31 Mar 22

Sung’s Substack • 3 HN points • 08 May 24

davidj.substack • 71 implied HN points • 01 Mar 23

Counting Stuff • 54 implied HN points • 11 Jul 23

Software Snack Bites • 50 implied HN points • 28 Jun 23

Data Science Weekly Newsletter • 19 implied HN points • 08 Dec 22

Data Science Weekly Newsletter • 19 implied HN points • 24 Nov 22

Data Science Weekly Newsletter • 19 implied HN points • 11 Aug 22

Data Science Weekly Newsletter • 19 implied HN points • 21 Jul 22

Data Science Weekly Newsletter • 19 implied HN points • 07 Jul 22

Data Science Weekly Newsletter • 19 implied HN points • 23 Jun 22

nick’s datastack • 1 HN point • 24 Apr 24

Data Science Weekly Newsletter • 19 implied HN points • 14 Apr 22

Data Science Weekly Newsletter • 19 implied HN points • 31 Mar 22

Data Science Weekly Newsletter • 19 implied HN points • 27 Jan 22

Gradient Flow • 19 implied HN points • 12 Aug 21

Data Engineering Central • 2 HN points • 19 Jul 23

Data Science Weekly Newsletter • 19 implied HN points • 29 Oct 20

ppdispatch • 2 implied HN points • 03 Jan 25

Data Products • 5 implied HN points • 08 Jan 24

Data Products • 3 implied HN points • 11 Dec 23

Data Products • 3 implied HN points • 04 Dec 23

Data Products • 2 implied HN points • 27 Feb 24

ingest this! • 2 HN points • 07 Feb 24

Data Science Weekly Newsletter • 19 implied HN points • 26 May 16

ingest this! • 1 HN point • 19 Feb 24

Data Products • 1 HN point • 07 Jul 23

Three Data Point Thursday • 0 implied HN points • 29 Jun 23

Reflective Software Engineering • 0 implied HN points • 12 Jan 24

Reflective Software Engineering • 0 implied HN points • 30 Dec 23

ingest this! • 0 implied HN points • 12 Mar 24

Tributary Data • 0 implied HN points • 28 Aug 23

Three Data Point Thursday • 0 implied HN points • 23 Mar 23

Three Data Point Thursday • 0 implied HN points • 15 Jun 23

Data at Depth • 0 implied HN points • 26 Apr 23

The Orchestra Data Leadership Newsletter • 0 implied HN points • 15 Dec 23

The Orchestra Data Leadership Newsletter • 0 implied HN points • 05 Dec 23

The Orchestra Data Leadership Newsletter • 0 implied HN points • 31 Oct 23

Bytewax • 0 implied HN points • 20 Apr 23

Three Data Point Thursday • 0 implied HN points • 28 Sep 23