The hottest Data Pipelines Substack posts right now

Why DuckDB is losing to Polars

Data Engineering Central • 373 implied HN points • 29 Jan 24

Technology innovations come from solving problems and gaining popularity.
Community engagement and real-world usage are important factors in tool evaluation.
Polars is gaining traction over DuckDB due to its versatility and widespread adoption.

Why Alerting is Key for enabling Generative AI and Machine Learning

The Orchestra Data Leadership Newsletter • 79 implied HN points • 23 Apr 24

🕹 Technology Data Operations AI Data Quality Data Pipelines

Alerting and governance are crucial for the success of Data and AI initiatives, as highlighted by the high failure rates of AI projects and Data Science projects not making it to production.
Building trust between Data Teams and Business Stakeholders is essential, and alerting plays a key role in this by ensuring effective communication and collaboration during data pipeline failures.
Effective alerting systems should be proactive, asset-based, and granular, allowing for quick detection and communication of issues to build trust and reliability in Data and AI products.

The "Brittleness" Problem in Data Pipelines.

Data Engineering Central • 157 implied HN points • 24 Apr 23

🕹 Technology Data Engineering Data Pipelines Resources

Brittleness in data pipelines can lead to various issues like data quality problems, difficult debugging, and slow performance.
To overcome brittle pipelines, focus on addressing data quality issues through monitoring, sanity checks, and using tools like Great Expectations.
Development issues such as lack of tests, poor documentation, and bad code practices contribute to brittle pipelines; implementing best practices like unit testing and Docker can help improve pipeline reliability.

A dbt cloud alternative with no-code ELT: end-to-end data pipelines in Coalesce on Snowflake

The Orchestra Data Leadership Newsletter • 39 implied HN points • 09 Jan 24

🕹 Technology Data Pipelines Data Transformation ETL

The article discusses building a data release pipeline to analyze Hubspot data using Coalesce, a no-code ELT tool on Snowflake.
One key issue encountered was the challenges with Hubspot's data model when trying to consolidate form fill data and messages into a meaningful view.
Setting up Coalesce involves defining storage mappings, granting access to Coalesce users, and carefully handling environments to prevent data overwriting when working between development and production.

How I use Gen AI as a Data Engineer

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Apr 24

🕹 Technology AI Data Engineering Data Pipelines Feature Engineering

Generative AI can help Data Engineers summarize vast quantities of structured and unstructured data, expanding the breadth and depth of data available.
Feature engineering using Generative AI involves ingesting unstructured data like call notes, making API calls, and transforming the data for analysis in existing pipelines.
Utilizing Generative AI for webscraping can help teams extract information efficiently from the internet, enabling monitoring of new data sources and optimizing business processes.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

How to build a dual Incremental + snapshot data ingestion pipeline

Practical Data Engineering Substack • 59 implied HN points • 01 Oct 23

🕹 Technology Data Engineering Data Pipelines Real-Time Processing Data Management

You can improve data accuracy by using two pipelines: one for getting recent updates quickly and another for regularly loading the entire dataset. This helps in keeping the data reliable over time.
It's essential to manage pipeline scheduling based on your business's needs, like how often you need updates. You can choose faster updates or less frequent full reloads depending on how critical the data is.
Using tools like Apache Airflow can help organize these pipelines efficiently. You can simplify tasks by dynamically generating them from a list, making it easier to handle many data tables.

Data Pipelines - Streams to Parquet

Bytewax • 19 implied HN points • 19 Dec 23

🕹 Technology Data Pipelines Stream Processing Data Storage Data processing

One common use case for stream processing is transforming data into a format for different systems or needs.
Bytewax is a Python stream processing framework that allows real-time data processing and customization.
Bytewax enables creating custom connectors for data sources and sinks, making it versatile for various data processing tasks.

What Suboptimal Public Transit and Your Data Pipelines Have in Common

Bytes, Data, Action! • 19 implied HN points • 05 Sep 23

🕹 Technology Data Pipelines Public Transit Monitoring Data Orchestration Data Management

Public transit and data pipelines both aim to move things from point A to point B smoothly and quickly.
Issues like delays, lack of visibility, and missed connections can disrupt the experiences of both public transit and data pipelines.
Efficient, transparent, and reliable practices are key to ensuring a smooth journey for both public transit users and data pipelines.

Orchestration isn’t going anywhere

Data People Etc. • 36 HN points • 24 Apr 23

🕹 Technology Data Management Data Pipelines Orchestration Software Engineering Organizational dynamics

Orchestration is essential and will continue to be important in the future of managing data pipelines.
Orchestration involves coordinating and managing multiple systems and tasks to execute workflows.
Tools like Dagster provide a control plane for managing data assets and metadata, ensuring a structured and cohesive data platform.

The LLM stack

Experiments with NLP and GPT-3 • 7 implied HN points • 23 Jun 23

🕹 Technology AI Data Pipelines

The LLM App stack is important in the AI world today.
Embeddings from OpenAI and Huggingface play a key role in giving meaning to data.
VectorDBs like Pinecone and Vespa are crucial for managing embeddings in the AI stack.

How we think about Data Pipelines is changing

The Orchestra Data Leadership Newsletter • 0 implied HN points • 08 Nov 23

🕹 Technology Data Pipelines Continuous Integration Infrastructure Observability

Data pipelines are transitioning towards a focus on reliability and efficiency, similar to software engineering practices.
Continuous Data Integration and Delivery in data engineering involves releasing data into production in response to code changes in a simple manner.
Observability and metadata gathering play a crucial role in ensuring data quality and preventing issues before they occur in data pipelines.

Managing Dependencies Between Data Pipelines

Practical Data Engineering Substack • 0 implied HN points • 26 Aug 23

🕹 Technology Data Engineering Data Pipelines Workflow management Analytics Cloud Computing

Managing dependencies between data pipelines is crucial for ensuring that upstream tasks are completed before downstream tasks start. This avoids issues with incomplete or faulty data.
There are different techniques to manage these dependencies, ranging from simple time-based scheduling to more complex orchestrations that adjust based on the successful completion of previous tasks.
Choosing the right method for managing pipeline dependencies depends on the complexity of the data workflows and the need for independence between different teams and tasks.