Data Streaming Journey

Data Streaming Journey explores data streaming, stream processing, challenges, and solutions in real-time data pipelines, with insights on technologies like Change Data Capture, Redpanda, and Apache Flink. It features interviews with industry innovators, reviews of data platforms, and discussions on design patterns for efficient data handling.

Data Streaming Techniques Stream Processing Data Ingestion Methods Real-time Data Pipelines Technology Reviews Data Architecture Design Patterns in Data Streaming Industry Interviews Event Driven Architectures State Management in Streaming Systems

The hottest Substack posts of Data Streaming Journey

And their main takeaways

Do you even need Kafka?

79 implied HN points • 28 Oct 24

Kafka and similar tools are still relevant and necessary for effective data streaming today. They help handle large amounts of data quickly and reliably.
Modern alternatives to Kafka, like Materialize and Debezium, simplify the process of working with operational data and make it easier to integrate with other tools.
Even if you only want to move data from a database to a data warehouse, using a streaming platform can benefit larger enterprises by making data integration more efficient.

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It.

237 implied HN points • 31 Jul 23

Change Data Capture (CDC) is a powerful data ingestion technique, especially log-based CDC.
CDC can lead to issues in data modeling if not handled properly.
Consider using the Transactional Outbox pattern with CDC for reliable delivery of application or domain events.

Interview with Richard Artoul (WarpStream)

138 implied HN points • 12 Oct 23

WarpStream eliminates local disks, making deployment easier and handling partition rebalancing automatically.
The team at WarpStream focuses on use cases when implementing features, running continuous workloads in staging and production environments.
WarpStream emphasizes correctness testing, inspired by simulation testing concepts like Jepsen, to ensure data integrity and fault tolerance.

Do you really need exactly-once delivery?

178 implied HN points • 17 Jul 23

Exactly-once delivery can still be hard to achieve and may result in data loss.
Consider why you need exactly-once delivery and explore alternatives such as upserts and deduplication.
Using upserts with data stores supporting this feature or implementing deduplication with a stateful streaming processor may be better options than focusing solely on exactly-once delivery.

Flink Forward 2023

99 implied HN points • 13 Nov 23

Flink Forward conference took place in Seattle with major industry practitioners like Netflix and Apple.
Ververica announced its Cloud offering is now GA and introduced a faster Flink runtime assembly.
The conference discussed unifying stream and batch processing, with a focus on dataset support for different query types.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Interview with Nikhil Benesch (Materialize)

79 implied HN points • 12 Dec 23

Materialize focuses on operational data warehouse positioning to solve problems faced by traditional data warehouses.
Materialize offers entirely SQL-based system for user experience while allowing hybrid analytical use cases as well.
Materialize plans to support User-Defined Functions (UDFs) using WASM in the future for more extensibility.

Building Event Ingestion HTTP API for Kafka

59 implied HN points • 29 Jan 24

Building an Ingestion HTTP API for Kafka is a popular data ingestion technique.
A custom Ingestion HTTP API offers more control over delivery semantics and data handling compared to off-the-shelf solutions.
Consider the trade-offs between synchronous and asynchronous write models when implementing an Ingestion HTTP API.

One Year With Redpanda, a Retrospective.

118 implied HN points • 24 Jul 23

Redpanda offers superior performance over Apache Kafka with higher throughput, lower latencies, and cost savings.
Redpanda's Kafka API compatibility, Redpanda Console functionality, and being packaged as a single binary are standout features.
Areas for improvement in Redpanda include Schema Registry management, metered pricing for serverless options, and consideration for making Tiered Storage free.

Current 2023

79 implied HN points • 06 Oct 23

Current 2023 conference in San Jose, California had disappointing keynotes from Confluent
Streaming databases are gaining traction but may not cover all data streaming challenges
Discovery of Conduktor and presentation on Restate were highlights at the conference

Actor-Oriented Data Streaming, Anyone?

39 implied HN points • 15 Jan 24

Modern data streaming frameworks usually focus on dataflow topologies with sources, transformations, and sinks.
Actors in programming can process messages serially but offer high concurrency by running many in parallel.
Actors can be used in data streaming scenarios, with possibilities for handling messaging, state management, and emitting events.

Organizational Dichotomy in Streaming Teams

59 implied HN points • 14 Aug 23

Organizational dichotomy in streaming teams can lead to conflicts and challenges between infrastructure and stream processing teams.
Infrastructure teams prioritize uptime and cost, while stream processing teams aim to innovate and deliver high-impact projects.
Collaboration strategies include treating teams as partners, focusing on innovation, and considering reorganization when necessary.

Changelog vs Append-Only Data Streams

59 implied HN points • 07 Aug 23

Changelog data streams have primary keys and operations like insert, update, or delete, borrowed from databases.
Append-only data streams, like clickstream and telemetry data, are time-series and don't support updates or deletes.
Stream-table duality is essential for modern data integration, allowing databases and stream-processing platforms to work together efficiently.

Blackhole Sink Pattern for Blue-Green Deployments

39 implied HN points • 30 Oct 23

Blackhole Sink Pattern is a simpler way to conduct blue-green deployments for streaming applications.
Zero-downtime deployments can be achieved in streaming applications using the Blackhole Sink Pattern.
Important considerations for the Blackhole Sink Pattern include managing other types of outputs and ensuring data sources allow concurrent consumers.

Interview with Gang Tao (Timeplus)

39 implied HN points • 19 Oct 23

Proton is a unified streaming and historical data processing engine that aims to make analytics workloads fast and efficient.
Proton chose ClickHouse as their foundation due to its speed, extensibility, and team experience with C++.
Proton supports SQL as the primary interface but also allows for extending capabilities through user-defined functions and embedded JavaScript.

State and Timers

19 implied HN points • 18 Sep 23

State and timers are essential building blocks for understanding stateful stream processing.
State is like an in-memory variable but backed by distributed storage and comes in different types like keyed, global, key-value, list, map, and versioned/timestamped.
Timers are crucial for controlling state size in streaming systems and can be used for defining windows and scheduling actions.

Data Platforms in 2030

1 HN point • 05 Sep 23

The current data platform architectures are complex and fragmented, lacking a unified tool for data processing.
There is a need for a more consolidated data platform that combines a streaming log, lakehouse, and OLAP engine into a single tool.
The industry is trending towards cloud-native architectures and technologies like streaming databases and Apache Arrow for more efficient and scalable data processing.

Considerations for Data Stream Materialization

0 implied HN points • 27 Nov 23

Consider whether to materialize data transformations or use direct message passing in streaming data projects.
Balance cost, performance, and reusability when deciding whether to create new Kafka topics for your system.
Choose technologies that support your desired data exchange approach to avoid limitations in system design.

Approaches for Defining Message Metadata

0 implied HN points • 21 Aug 23

Message metadata is as important as core datasets in data streaming systems.
Different approaches like including metadata in the payload, using a message envelope, or leveraging message headers can be utilized to define message metadata.
Message metadata can be defined in various places, such as payload metadata, message envelope, message headers, topic names, and message schema, each with its own advantages and considerations.