The hottest ETL Substack posts right now

And their main takeaways
Category
Top Technology Topics
Minimal Modeling 304 implied HN points 15 Mar 26
  1. Treat queries as functions and start by defining anchors: maintain a compact one‑column list of unique IDs for each entity and document retention/archive rules so input data quality is clear.
  2. Represent attributes and links as clean two‑column datasets (anchor ID + value or anchor ID + anchor ID), filter out NULLs and sentinel values, canonicalize values, use only atomic types, and ensure uniqueness.
  3. Materialize those compact datasets and keep them updated with a pipeline so your data is correct by construction; from these trusted pieces you can build flat tables while avoiding common issues like duplicates, unclear identity, and messy JSON.
SeattleDataGuy’s Newsletter 906 implied HN points 23 Feb 26
  1. Backfills are an unavoidable part of data work — you need them when source data is corrected, pipelines have bugs, or schemas and logic change.
  2. They’re hated because they can be expensive, slow, and risky at scale, can disrupt downstream users, and erode stakeholder trust when numbers shift unexpectedly.
  3. Design for safe backfills by building parameterized, rerunnable pipelines, adding strong data quality checks, communicating changes clearly, and using table-swaps or other strategies when partitions or immutable storage formats make in-place fixes risky.
The Orchestra Data Leadership Newsletter 79 implied HN points 25 Feb 24
  1. ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) have been key data engineering paradigms, but with the rise of the cloud, the need for in-transit data transformation has decreased.
  2. Fivetran, a widely known data company, is potentially shifting back to ETL methods by offering pre-built transformation features, effectively simplifying the data modeling process for users.
  3. There seems to be a trend towards a possible resurgence of ETL practices in the data industry, with companies like Fivetran potentially leading the way in providing ETL-like services within their platforms.
The Orchestra Data Leadership Newsletter 39 implied HN points 09 Jan 24
  1. The article discusses building a data release pipeline to analyze Hubspot data using Coalesce, a no-code ELT tool on Snowflake.
  2. One key issue encountered was the challenges with Hubspot's data model when trying to consolidate form fill data and messages into a meaningful view.
  3. Setting up Coalesce involves defining storage mappings, granting access to Coalesce users, and carefully handling environments to prevent data overwriting when working between development and production.
The Orchestra Data Leadership Newsletter 39 implied HN points 30 Dec 23
  1. Data teams are increasingly turning to low-code solutions to streamline data release pipelines, utilizing tools like Airflow but questioning the need for extensive code writing and infrastructure maintenance.
  2. The complex cloud environment has led to the development of specialized data tools, making the orchestration of data pipelines challenging and highlighting the importance of governance, data quality, and scalability.
  3. No-code solutions like dbt core and Hightouch are already integrated into many data tools, simplifying the orchestration process and indicating that the future of data architecture might involve a combination of workflow orchestrators and efficient data quality checks.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
davidj.substack 47 implied HN points 23 Feb 24
  1. Real-time data streaming from databases like MySQL to data warehouses such as Snowflake can significantly reduce analytics latency, making data processing faster and more efficient.
  2. Streamkap offers a cost-effective solution for streaming ETL, promising to be both faster and more affordable than traditional methods like Fivetran, providing a valuable option for data professionals.
  3. Implementing Streamkap in data architectures can lead to substantial improvements, such as reducing data update lag to under 5 minutes and delivering real-time analytics value for customers, showcasing the impact of cutting-edge data technology.
Pedram's Data Based 14 implied HN points 04 Apr 23
  1. The tech industry is experiencing pushback against the complexity of recent years.
  2. The gap between the cost of data solutions and their value is increasing, leading to a re-examination of needs.
  3. Exploring old ways of solving data problems before the abundance of vendors can provide valuable insights for building more robust systems.
realkinetic 0 implied HN points 15 Oct 20
  1. AWS Glue is a managed service for building ETL jobs on AWS, eliminating the need to manage server infrastructure and making it easy to implement analytics pipelines.
  2. Automating the deployment process of Glue jobs with a CI/CD pipeline, using tools like GitHub Actions, can streamline the workflow and ensure continuous deployment of ETL processes.
  3. Using GitHub Actions, you can convert Jupyter notebooks to Python scripts, upload them to S3, update Glue jobs, and configure AWS CLI for deployment, making the process efficient and scalable.
Tributary Data 0 implied HN points 29 Sep 22
  1. Stateful stream processors and streaming databases have different approaches in handling data ingestion and state persistence.
  2. Stream processors require knowing and embedding state manipulation logic in advance, while streaming databases offer ad-hoc manipulation by consumers.
  3. Stream processors are ideal for automated, machine-driven decision-making, while streaming databases cater to human decision-makers needing fast, ad-hoc data access.
Expand Mapping with Mike Morrow 0 implied HN points 27 Feb 26
  1. A warehouse migration is a multi-step project where tasks range from easy to very hard. Some small changes like updating BI connections are quick, but others need significant effort.
  2. Medium-effort work like schema mapping, one-time backfills, and reconfiguring pipelines is necessary and requires careful data validation. These steps are manageable but time-consuming.
  3. The hardest parts are deciding what data to keep, rewriting transformations, running both warehouses in parallel, and recreating access controls. Those areas carry the most risk and will dominate the timeline.