The hottest Data Engineering Substack posts right now

And their main takeaways
Category
Top Technology Topics
Ju Data Engineering Newsletter • 396 implied HN points • 28 Oct 24
  1. Improving the user interface is crucial for more teams to use Iceberg, especially those that use Python for their data work.
  2. PyIceberg, which is a Python implementation, is evolving quickly and currently supports various catalog and file system types.
  3. While PyIceberg makes it easy to read and write data, it has some limitations, especially compared to using Iceberg with Spark, like handling deletes and managing metadata.
SeattleDataGuy’s Newsletter • 706 implied HN points • 02 Mar 26
  1. Layering tools and roles keeps adding complexity until systems become fractal sprawl that’s costly and hard to maintain.
  2. Buying managed platforms can replace people and speed delivery short-term, but it often buries business logic and makes it harder to connect technical work to business outcomes, so teams tend to add even more layers.
  3. Before adding any new layer, ask what problem it solves, what happens if you don’t add it, and who will own it in six months—if you can’t answer, you’re creating liability instead of leverage.
Minimal Modeling • 304 implied HN points • 15 Mar 26
  1. Treat queries as functions and start by defining anchors: maintain a compact one‑column list of unique IDs for each entity and document retention/archive rules so input data quality is clear.
  2. Represent attributes and links as clean two‑column datasets (anchor ID + value or anchor ID + anchor ID), filter out NULLs and sentinel values, canonicalize values, use only atomic types, and ensure uniqueness.
  3. Materialize those compact datasets and keep them updated with a pipeline so your data is correct by construction; from these trusted pieces you can build flat tables while avoiding common issues like duplicates, unclear identity, and messy JSON.
SeattleDataGuy’s Newsletter • 906 implied HN points • 23 Feb 26
  1. Backfills are an unavoidable part of data work — you need them when source data is corrected, pipelines have bugs, or schemas and logic change.
  2. They’re hated because they can be expensive, slow, and risky at scale, can disrupt downstream users, and erode stakeholder trust when numbers shift unexpectedly.
  3. Design for safe backfills by building parameterized, rerunnable pipelines, adding strong data quality checks, communicating changes clearly, and using table-swaps or other strategies when partitions or immutable storage formats make in-place fixes risky.
Ju Data Engineering Newsletter • 515 implied HN points • 17 Oct 24
  1. The use of Iceberg allows for separate storage and compute, making it easier to connect single-node engines to the data pipeline without needing extra steps.
  2. There are different approaches to integrating single-node engines, including running all processes in one worker or handling each transformation with separate workers.
  3. Partitioning data can improve efficiency by allowing independent processing of smaller chunks, which reduces the limitations of memory and speeds up data handling.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Jacob’s Tech Tavern • 1312 implied HN points • 17 Feb 26
  1. A single feature can balloon into a ludicrously elaborate pipeline that combines webscraping, long-running downloads, parsing and storage of large data, real-time analysis, and high-volume upload/polling.
  2. Most engineering work is routine, but rare peak challenges require orchestrating many moving parts and constant attention so they don’t overwhelm the team.
  3. Making a reliable system on top of unreliable third-party services takes sustained hardening and ongoing “whack-a-mole” maintenance to turn an MVP into production-grade software.
VuTrinh. • 1658 implied HN points • 24 Aug 24
  1. Parquet is a special file format that organizes data in columns. This makes it easier and faster to access specific data when you don't need everything at once.
  2. The structure of Parquet involves grouping data into row groups and column chunks. This helps balance the performance of reading and writing data, allowing users to manage large datasets efficiently.
  3. Parquet uses smart techniques like dictionary and run-length encoding to save space. These methods reduce the amount of data stored and speed up the reading process by minimizing the data that needs to be scanned.
VuTrinh. • 399 implied HN points • 17 Sep 24
  1. Metadata is really important because it helps organize and access data efficiently. It tells systems where files are and which ones can be ignored during processing.
  2. Google's BigQuery uses a unique system to manage metadata that allows for quick access and analysis of huge datasets. Instead of putting metadata with the data, it keeps them separate but organized in a smart way.
  3. The way BigQuery handles metadata improves performance by making sure that only the relevant information is accessed when running queries. This helps save time and resources, especially with very large data sets.
SeattleDataGuy’s Newsletter • 788 implied HN points • 09 Feb 26
  1. Data pipelines exist to create trust in your data by making it timely, accurate, consistent, recoverable, and scalable.
  2. They centralize and integrate siloed data so analysts, automations, and models can access well‑modeled, usable datasets.
  3. Build pipelines with clear business outcomes and ownership or they become costly technical liabilities; examples include reducing discounts, improving onboarding, and cutting support costs.
VuTrinh. • 859 implied HN points • 03 Sep 24
  1. Kubernetes is a powerful tool for managing containers, which are bundles of apps and their dependencies. It helps you run and scale many containers across different servers smoothly.
  2. Understanding how Kubernetes works is key. It compares the actual state of your application with the desired state to make adjustments, ensuring everything runs as expected.
  3. To start with Kubernetes, begin small and simple. Use local tools for practice, and learn step-by-step to avoid feeling overwhelmed by its many components.
benn.substack • 1380 implied HN points • 23 Jan 26
  1. Writing and reading SQL demand different styles: shortcuts and shorthand speed up writing but make queries harder to understand, and teams often prioritize writing convenience over clarity.
  2. With AI generating much of the code, development has shifted to a "vibe and verify" model, but data work is hard to verify because queries and analyses are difficult to check by eye or prose alone.
  3. The solution is better representations for comprehension — diagrams, clearer formatting, or a language/app that turns any query into an accessible, annotated picture so humans can quickly verify what the computation actually did.
VuTrinh. • 139 implied HN points • 24 Sep 24
  1. Google's BigLake allows users to access and manage data across different storage solutions like BigQuery and object storage. This makes it easier to work with big data without needing to move it around.
  2. The Storage API enhances BigQuery by letting external tools like Apache Spark and Trino directly access its stored data, speeding up the data processing and analysis.
  3. BigLake tables offer strong security features and better performance for querying open-source data formats, making it a more robust option for businesses that need efficient data management.
SeattleDataGuy’s Newsletter • 741 implied HN points • 31 Jan 26
  1. Big cloud vendors will keep rebranding and repositioning their data products to appear 'AI-first', adding marketing noise and confusion about which tools to use.
  2. Almost all companies still rely on Excel, SFTP, and manual exports. Only a small share chase flashy AI while most need simple tools to convert spreadsheets into reliable data pipelines.
  3. The modern data stack will be shaken by acquisitions, price changes, and fragile pipelines, forcing many teams to rebuild infrastructure and turn AI proofs-of-concept into production-ready foundations.
VuTrinh. • 279 implied HN points • 14 Sep 24
  1. Uber evolved from simple data management with MySQL to a more complex system using Hadoop to handle huge amounts of data efficiently.
  2. They faced challenges with data reliability and latency, which slowed down their ability to make quick decisions.
  3. Uber introduced a system called Hudi that allowed for faster updates and better data management, helping them keep their data fresh and accurate.
VuTrinh. • 519 implied HN points • 27 Aug 24
  1. AutoMQ enables Kafka to run entirely on object storage, which improves efficiency and scalability. This design removes the need for tightly-coupled compute and storage, allowing more flexible resource management.
  2. AutoMQ uses a unique caching system to handle data, which helps maintain fast performance for both recent and historical data. It has separate caches for immediate and long-term data needs, enhancing read and write speeds.
  3. Reliability in AutoMQ is ensured through a Write Ahead Log system using AWS EBS, which helps recover data after crashes. This setup allows for fast failover and data persistence, so no messages get lost.
VuTrinh. • 799 implied HN points • 10 Aug 24
  1. Apache Iceberg is a table format that helps manage data in a data lake. It makes it easier to organize files and allows users to interact with data without worrying about how it's stored.
  2. Iceberg has a three-layer architecture: data, metadata, and catalog, which work together to track and manage the actual data and its details. This structure allows for efficient querying and data operations.
  3. One cool feature of Iceberg is its ability to time travel, meaning you can access previous versions of your data. This lets you see changes and retrieve earlier data as needed.
SeattleDataGuy’s Newsletter • 718 implied HN points • 14 Jan 26
  1. A reliable pipeline system needs many core components—secure secrets and connection management, rich logging and monitoring, dependency tracking, execution routing, scheduling, data quality checks, pipeline definitions, and a usable UI—because missing any of these creates ongoing operational headaches.
  2. Operational practices like idempotency and easy backfilling, clear ownership, alerting/on-call routing, and environment isolation are critical so reruns don’t create duplicates and failures get handled quickly.
  3. Most teams should prefer existing tools unless they have a clear reason to build. If you do build, explicitly scope features—like compute routing or AI integrations—and plan for long‑term maintenance.
VuTrinh. • 339 implied HN points • 31 Aug 24
  1. Apache Iceberg organizes data into a data layer and a metadata layer, making it easier to manage large datasets. The data layer holds the actual records, while the metadata layer keeps track of those records and their changes.
  2. Iceberg's manifest files help improve read performance by storing statistics for multiple data files in one place. This means the reader can access all needed statistics without opening each individual data file.
  3. Hidden partitioning in Iceberg allows users to filter data without needing extra columns, saving space. It records transformations on columns instead, helping streamline queries and manage data efficiently.
SeattleDataGuy’s Newsletter • 859 implied HN points • 05 Jan 26
  1. Data pipelines come in many shapes — from source standardization and amalgamation to enrichment, operational syncs, and even manual Excel-based processes — each built for different business needs.
  2. Common challenges are mapping and standardizing varied formats, keeping reliable IDs and timing for joins, and handling data quality and system-specific ingestion limits.
  3. Despite the variety, pipelines all aim to move and transform source data into usable outputs for analytics, operations, or ML, and they often follow the same extract-transform-load steps that can be automated and productionized.
Complexity is overrated • 85 implied HN points • 24 Feb 26
  1. Data should be viewed as a stream of events rather than just a static database state, and Kafka implements this by providing a distributed immutable commit log that decouples producers and consumers.
  2. Kafka is extremely versatile and gets used for many scenarios beyond its original use case, but teams often pigeonhole it or call it overkill for problems it can actually solve well.
  3. An expanding Kafka ecosystem (Kafka++) — integrating tools like Flink and Iceberg — makes real-time streaming data more useful for analytics, data engineering, and operational use cases, widening who can benefit from Kafka.
VuTrinh. • 399 implied HN points • 20 Aug 24
  1. Discord started with its own tool called Derived to manage data, but it found this system limited as it grew. They needed a better way to handle complex data tasks.
  2. They switched to using popular tools like Dagster and dbt. This helped them automate and better manage their data processes.
  3. With the new setup, Discord can now make changes quickly and safely, which improves how they analyze and use their vast amounts of data.
VuTrinh. • 519 implied HN points • 06 Aug 24
  1. Notion uses a flexible block system, letting users customize how they organize their notes and projects. Each block can be changed and moved around, making it easy to create what you need.
  2. To manage the huge amount of data, Notion shifted from a single database to a more complex setup with multiple shards and instances. This change helps them handle stronger user demands and analytics needs more efficiently.
  3. By creating an in-house data lake, Notion saved a lot of money and improved data processing speed. This new system allows them to quickly get data from their main database for analytics and support new features like AI.
Nicolas Bustamante • 435 implied HN points • 24 Jan 26
  1. Isolated sandboxes and an S3-first, filesystem-backed architecture are essential for safely running multi-step agent workflows and giving each user a private, replayable execution environment.
  2. Clean, normalized context is the product: chunked markdown narratives, structured CSV/tables, and rich JSON metadata are what let agents reliably reason over messy financial sources like SEC filings.
  3. Skills plus the surrounding experience are the moat: lightweight, editable markdown skills, rigorous evals, real-time streaming UX, long-running orchestration, and production monitoring make the product reliable and defensible as models improve.
SeattleDataGuy’s Newsletter • 1036 implied HN points • 09 Dec 25
  1. Using the 'exploration' approach in interviews helps candidates show their true understanding of data engineering. It starts with a broad view and zooms into details, making for engaging, productive conversations.
  2. Creating a human connection during interviews is important. Small personal introductions can ease candidates' nerves, allowing them to perform better when discussing technical topics.
  3. Assessing both breadth and depth of knowledge is key in interviews. Good candidates can explain how different data technologies work together and understand the reasoning behind their choices.
Data Science Weekly Newsletter • 139 implied HN points • 05 Sep 24
  1. AI prompt engineering is becoming more important, and experts share helpful tips on how to improve your skill in this area.
  2. Researchers in AI should focus on making an impact through their work by creating open-source resources and better benchmarks.
  3. Data quality is a common concern in many organizations, yet many leaders struggle to prioritize it properly and invest in solutions.
Data Science Weekly Newsletter • 179 implied HN points • 29 Aug 24
  1. Distributed systems are changing a lot. This affects how we operate and program these systems, making them more secure and easier to manage.
  2. Statistics are really important in everyday life, even if we don't see it. Talks this year aim to inspire students to understand and appreciate statistics better.
  3. Understanding how AI models work internally is a growing field. Many AI systems are complex, and researchers want to learn how they make decisions and produce outputs.
VuTrinh. • 279 implied HN points • 17 Aug 24
  1. Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
  2. Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
  3. Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.
VuTrinh. • 299 implied HN points • 13 Aug 24
  1. LinkedIn uses Apache Kafka to manage a massive flow of information, handling around 7 trillion messages every day. They set up a complex system of clusters and brokers to ensure everything runs smoothly.
  2. To keep everything organized, LinkedIn has a tiered system where data is processed locally in each data center, then sent to an aggregate cluster. This helps them avoid issues from moving data across different locations.
  3. LinkedIn has an auditing tool to make sure all messages are tracked and nothing gets lost during transmission. This helps them quickly identify any problems and fix them efficiently.
VuTrinh. • 359 implied HN points • 30 Jul 24
  1. Netflix's data engineering stack uses tools like Apache Iceberg and Spark for building batch data pipelines. This helps them transform and manage large amounts of data efficiently.
  2. For real-time data processing, Netflix relies on Apache Flink and a tool called Keystone. This setup makes it easier to handle streaming data and send it where it needs to go.
  3. To ensure data quality and scheduling, Netflix has developed tools like the WAP pattern for auditing data and Maestro for managing workflows. These tools help keep the data process organized and reliable.
SeattleDataGuy’s Newsletter • 541 implied HN points • 12 Dec 25
  1. Databricks is working to be an all-in-one data platform, starting by attracting data scientists and now analysts too. They want to be seen as a solution that can fit everyone's data needs.
  2. Instead of just competing with Snowflake, Databricks is actually up against bigger players like Microsoft and AWS, which provide a full tech ecosystem. Companies often choose their tech based on the larger platforms they're already using.
  3. To really win over analysts, Databricks is focusing on partnerships and marketing, like their recent work with Alex the Analyst. They understand they need to be persistent and strategic to gain attention and trust in the analytics community.
VuTrinh. • 299 implied HN points • 03 Aug 24
  1. LinkedIn's data infrastructure is organized into three main tiers: data, service, and display. This setup helps the system to scale easily without moving data around.
  2. Voldemort is LinkedIn's key-value store that efficiently handles high-traffic queries and allows easy scaling by adding new nodes without downtime.
  3. Databus is a change data capture system that keeps LinkedIn's databases synchronized across applications, allowing for quick updates and consistent data flow.
VuTrinh. • 539 implied HN points • 06 Jul 24
  1. Apache Kafka is a system for handling large amounts of data messages, making it easier for companies like LinkedIn to manage and analyze user activity and other important metrics.
  2. In Kafka, messages are organized into topics and divided into partitions, allowing for better performance and scalability. This way, different servers can handle parts of the data at once.
  3. Kafka uses a pull model for consumers, meaning they can request data as they need it. This helps prevent overwhelming the consumers with too much data at once.
Recommender systems • 76 implied HN points • 23 Feb 26
  1. Bluesky builds Discover personalization from fixed post embeddings (BLIP2) plus broad topic labels and finer HDBSCAN clusters to track user interests, after an initial two‑tower retrieval approach didn’t work out.
  2. PinnerSage captures diverse short‑ and long‑term interests by clustering a user’s recent interactions into many medoids, scoring each cluster with a time‑decay importance, and using those medoids as weighted seeds for ANN candidate retrieval.
  3. Multiple per‑user medoids ease retrieval but complicate ranking, so the plan is to use PinnerSage for candidate generation and then adopt a transformer (PinnerFormer) to create a single user embedding for efficient, accurate ranking.
VuTrinh. • 339 implied HN points • 23 Jul 24
  1. AWS offers a variety of tools for data engineering like S3, Lambda, and Step Functions, which can help anyone build scalable projects. These tools are often underused compared to newer options but are still very effective.
  2. Services like SNS and SQS can help manage data flow and processing. SNS allows for publishing messages while SQS aids in handling high event volumes asynchronously.
  3. Using AWS for data engineering is often simpler than switching to modern tools. It's easier to add new AWS services to your existing workflow than to migrate to something completely new.
Nicolas Bustamante • 104 implied HN points • 11 Feb 26
  1. Context tokens are expensive and degrade performance as they accumulate, so treat context as a scarce resource and keep prompts stable and append-only; move dynamic pieces (like timestamps) to the end so you preserve KV cache hits.
  2. Architect agents to minimize tokens by storing tool outputs as files, using precise two-step tools that return metadata before full content, delegating work to cheaper subagents, reusing templates, batching or parallelizing tool calls, and caching common responses at the application level.
  3. Clean and compact data before sending it to the model, place critical information at the beginning or end to avoid the lost-in-the-middle problem, use summarization/compaction before hitting pricing cliffs, and set strict output token limits to control costly outputs.
Data Science Weekly Newsletter • 139 implied HN points • 22 Aug 24
  1. When building web applications, using Postgres for data storage is a good default choice. It's reliable and widely used.
  2. A new study shows that agents can learn useful skills without rewards or guidance. They can explore and develop abilities just from observing a goal.
  3. The list of important books and resources in Bayesian statistics is being compiled. It's a way to recognize influential ideas in this field.
davidj.substack • 95 implied HN points • 06 Feb 26
  1. Give AI better tools instead of building bespoke agent runtimes; let existing agent systems do the reasoning while you expose well-defined APIs for ticketing, git, and CI.
  2. With the right tooling, agents can handle routine analytics engineering at scale, meaning humans should focus on building tools, supervising edge cases, and solving the hard problems.
  3. Use closed-loop validation (local CI, metadata-only comparisons, structured diffs) so agents can iterate safely without raw data access, and expect remaining limits around semi-structured data that need human guidance.
Minimal Modeling • 202 implied HN points • 12 Jan 26
  1. Model joins by attaching a nested dataset to each outer row and then flattening by duplicating the outer row for each inner row; if the inner set is empty you skip the outer row for INNER JOIN or replace it with a single NULL row for LEFT JOIN.
  2. The inner part of a query becomes very simple: INNER JOIN is just a filtered SELECT, GROUP BY is an aggregated filtered SELECT, and LEFT JOIN is a filtered SELECT plus a conditional UNION ALL NULL row, so no special-casing is needed.
  3. Splitting queries into an outer table and a per-row inner dataset gives a clear, teachable mental model and a single canonical flattening rule you can reuse to reason about more complex SQL patterns like correlated subqueries.
SeattleDataGuy’s Newsletter • 447 implied HN points • 17 Nov 25
  1. Moving from senior to staff data engineer requires developing non-technical skills like communication and project management. It's important to help your teammates and have a holistic view of your work.
  2. Staff engineers need to be adaptable and handle more responsibilities beyond coding, such as mentoring and collaboration. They also need to maintain good relationships with different teams and stakeholders.
  3. A clear understanding of project goals and the ability to design scalable solutions are essential. This often involves diagramming ideas and determining what should be built in-house versus what can be delegated.
VuTrinh. • 259 implied HN points • 13 Jul 24
  1. Kafka uses the operating system's filesystem to store data, which helps it run faster by leveraging the page cache. This avoids the need to keep too much data in memory, making it simpler to manage.
  2. The way Kafka reads and writes data is done in a sequential order, which is more efficient than random access. This design improves performance, as accessing data in a sequence reduces delays.
  3. Kafka groups messages together before sending them, which helps reduce the number of requests made to the system. This batching process improves performance by allowing larger, more efficient data transfers.