Data Streaming Journey

Data Streaming Journey explores data streaming, stream processing, challenges, and solutions in real-time data pipelines, with insights on technologies like Change Data Capture, Redpanda, and Apache Flink. It features interviews with industry innovators, reviews of data platforms, and discussions on design patterns for efficient data handling.

Data Streaming Techniques Stream Processing Data Ingestion Methods Real-time Data Pipelines Technology Reviews Data Architecture Design Patterns in Data Streaming Industry Interviews Event Driven Architectures State Management in Streaming Systems

The hottest Substack posts of Data Streaming Journey

And their main takeaways
79 implied HN points 28 Oct 24
  1. Kafka and similar tools are still relevant and necessary for effective data streaming today. They help handle large amounts of data quickly and reliably.
  2. Modern alternatives to Kafka, like Materialize and Debezium, simplify the process of working with operational data and make it easier to integrate with other tools.
  3. Even if you only want to move data from a database to a data warehouse, using a streaming platform can benefit larger enterprises by making data integration more efficient.
237 implied HN points 31 Jul 23
  1. Change Data Capture (CDC) is a powerful data ingestion technique, especially log-based CDC.
  2. CDC can lead to issues in data modeling if not handled properly.
  3. Consider using the Transactional Outbox pattern with CDC for reliable delivery of application or domain events.
138 implied HN points 12 Oct 23
  1. WarpStream eliminates local disks, making deployment easier and handling partition rebalancing automatically.
  2. The team at WarpStream focuses on use cases when implementing features, running continuous workloads in staging and production environments.
  3. WarpStream emphasizes correctness testing, inspired by simulation testing concepts like Jepsen, to ensure data integrity and fault tolerance.
178 implied HN points 17 Jul 23
  1. Exactly-once delivery can still be hard to achieve and may result in data loss.
  2. Consider why you need exactly-once delivery and explore alternatives such as upserts and deduplication.
  3. Using upserts with data stores supporting this feature or implementing deduplication with a stateful streaming processor may be better options than focusing solely on exactly-once delivery.
99 implied HN points 13 Nov 23
  1. Flink Forward conference took place in Seattle with major industry practitioners like Netflix and Apple.
  2. Ververica announced its Cloud offering is now GA and introduced a faster Flink runtime assembly.
  3. The conference discussed unifying stream and batch processing, with a focus on dataset support for different query types.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
79 implied HN points 12 Dec 23
  1. Materialize focuses on operational data warehouse positioning to solve problems faced by traditional data warehouses.
  2. Materialize offers entirely SQL-based system for user experience while allowing hybrid analytical use cases as well.
  3. Materialize plans to support User-Defined Functions (UDFs) using WASM in the future for more extensibility.
59 implied HN points 29 Jan 24
  1. Building an Ingestion HTTP API for Kafka is a popular data ingestion technique.
  2. A custom Ingestion HTTP API offers more control over delivery semantics and data handling compared to off-the-shelf solutions.
  3. Consider the trade-offs between synchronous and asynchronous write models when implementing an Ingestion HTTP API.
118 implied HN points 24 Jul 23
  1. Redpanda offers superior performance over Apache Kafka with higher throughput, lower latencies, and cost savings.
  2. Redpanda's Kafka API compatibility, Redpanda Console functionality, and being packaged as a single binary are standout features.
  3. Areas for improvement in Redpanda include Schema Registry management, metered pricing for serverless options, and consideration for making Tiered Storage free.
79 implied HN points 06 Oct 23
  1. Current 2023 conference in San Jose, California had disappointing keynotes from Confluent
  2. Streaming databases are gaining traction but may not cover all data streaming challenges
  3. Discovery of Conduktor and presentation on Restate were highlights at the conference
39 implied HN points 15 Jan 24
  1. Modern data streaming frameworks usually focus on dataflow topologies with sources, transformations, and sinks.
  2. Actors in programming can process messages serially but offer high concurrency by running many in parallel.
  3. Actors can be used in data streaming scenarios, with possibilities for handling messaging, state management, and emitting events.
59 implied HN points 14 Aug 23
  1. Organizational dichotomy in streaming teams can lead to conflicts and challenges between infrastructure and stream processing teams.
  2. Infrastructure teams prioritize uptime and cost, while stream processing teams aim to innovate and deliver high-impact projects.
  3. Collaboration strategies include treating teams as partners, focusing on innovation, and considering reorganization when necessary.
59 implied HN points 07 Aug 23
  1. Changelog data streams have primary keys and operations like insert, update, or delete, borrowed from databases.
  2. Append-only data streams, like clickstream and telemetry data, are time-series and don't support updates or deletes.
  3. Stream-table duality is essential for modern data integration, allowing databases and stream-processing platforms to work together efficiently.
39 implied HN points 30 Oct 23
  1. Blackhole Sink Pattern is a simpler way to conduct blue-green deployments for streaming applications.
  2. Zero-downtime deployments can be achieved in streaming applications using the Blackhole Sink Pattern.
  3. Important considerations for the Blackhole Sink Pattern include managing other types of outputs and ensuring data sources allow concurrent consumers.
39 implied HN points 19 Oct 23
  1. Proton is a unified streaming and historical data processing engine that aims to make analytics workloads fast and efficient.
  2. Proton chose ClickHouse as their foundation due to its speed, extensibility, and team experience with C++.
  3. Proton supports SQL as the primary interface but also allows for extending capabilities through user-defined functions and embedded JavaScript.
19 implied HN points 18 Sep 23
  1. State and timers are essential building blocks for understanding stateful stream processing.
  2. State is like an in-memory variable but backed by distributed storage and comes in different types like keyed, global, key-value, list, map, and versioned/timestamped.
  3. Timers are crucial for controlling state size in streaming systems and can be used for defining windows and scheduling actions.
1 HN point 05 Sep 23
  1. The current data platform architectures are complex and fragmented, lacking a unified tool for data processing.
  2. There is a need for a more consolidated data platform that combines a streaming log, lakehouse, and OLAP engine into a single tool.
  3. The industry is trending towards cloud-native architectures and technologies like streaming databases and Apache Arrow for more efficient and scalable data processing.
0 implied HN points 27 Nov 23
  1. Consider whether to materialize data transformations or use direct message passing in streaming data projects.
  2. Balance cost, performance, and reusability when deciding whether to create new Kafka topics for your system.
  3. Choose technologies that support your desired data exchange approach to avoid limitations in system design.
0 implied HN points 21 Aug 23
  1. Message metadata is as important as core datasets in data streaming systems.
  2. Different approaches like including metadata in the payload, using a message envelope, or leveraging message headers can be utilized to define message metadata.
  3. Message metadata can be defined in various places, such as payload metadata, message envelope, message headers, topic names, and message schema, each with its own advantages and considerations.