The hottest Real-Time Processing Substack posts right now

And their main takeaways

AI Must Think Before IT Speaks, But Sometimes it Shouldn't Speak At All

Impertinent • 59 implied HN points • 27 Oct 24

🕹 Technology AI Real-Time Processing Machine Learning

AI models should learn to think carefully before speaking. This helps them provide better responses and avoid mistakes.
Sometimes, AI doesn't need to say anything at all to be helpful. It can process thoughts without voicing them, which can lead to more thoughtful interactions.
In real-time voice systems, it's important to manage what the AI says. Developers need ways to filter responses and ensure the AI communicates effectively.

How did Facebook design their Real-Time Processing ecosystem

VuTrinh. • 279 implied HN points • 17 Aug 24

🕹 Technology Data Engineering Real-Time Processing System Design Software Architecture Big Data

Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.

Apache Kafka - Producer

VuTrinh. • 199 implied HN points • 20 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Distributed Systems Real-Time Processing

Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.

4 Trillion Events Daily at LinkedIn

VuTrinh. • 319 implied HN points • 08 Jun 24

🕹 Technology Data Engineering Real-Time Processing Machine Learning Software Development Cloud Computing

LinkedIn processes around 4 trillion events every day, using Apache Beam to unify their streaming and batch data processing. This helps them run pipelines more efficiently and save development time.
By switching to Apache Beam, LinkedIn significantly improved their performance metrics. For example, one pipeline's processing time went from over 7 hours to just 25 minutes.
Their anti-abuse systems became much faster with Beam, reducing the time taken to identify abusive actions from a day to just 5 minutes. This increase in efficiency greatly enhances user safety and experience.

How Twitter processes 4 billion events in real-time daily

VuTrinh. • 339 implied HN points • 25 May 24

🕹 Technology Data Engineering Real-Time Processing Cloud Computing Data architecture Big Data

Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

How does Uber build real-time infrastructure to handle petabytes of data every day?

VuTrinh. • 659 implied HN points • 23 Mar 24

🕹 Technology Data Engineering Infrastructure Real-Time Processing Open Source Big Data

Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.

GroupBy #44: Meta | The Data Stack

VuTrinh. • 119 implied HN points • 16 Jul 24

🕹 Technology Data Engineering Infrastructure Data Analytics Software Development Real-Time Processing

Meta uses a complex data warehouse to manage millions of tables and keeps data only as long as it's needed. Data is organized into namespaces for efficient querying.
They built tools like iData for data discovery and Scuba for real-time analytics. These tools help engineers find and analyze data quickly.
Data engineers at Meta develop pipelines mainly with SQL and Python, using internal tools for orchestration and monitoring to ensure everything runs smoothly.

Shift-Left Analytics

SUP! Hubert’s Substack • 50 implied HN points • 22 Nov 24

🕹 Technology Data Analytics Real-Time Processing Machine Learning Data Quality Database Management

Shift-left analytics means doing analysis early in the data process. This helps in getting insights faster and making quick decisions.
It focuses on checking data quality right away, so only reliable data is used. This leads to more accurate insights and avoids problems caused by bad data.
Collaboration between teams is encouraged in this approach. By working together from the start, everyone can ensure their analyses are useful and aligned with business goals.

How to build a dual Incremental + snapshot data ingestion pipeline

Practical Data Engineering Substack • 59 implied HN points • 01 Oct 23

🕹 Technology Data Engineering Data Pipelines Real-Time Processing Data Management

You can improve data accuracy by using two pipelines: one for getting recent updates quickly and another for regularly loading the entire dataset. This helps in keeping the data reliable over time.
It's essential to manage pipeline scheduling based on your business's needs, like how often you need updates. You can choose faster updates or less frequent full reloads depending on how critical the data is.
Using tools like Apache Airflow can help organize these pipelines efficiently. You can simplify tasks by dynamically generating them from a list, making it easier to handle many data tables.

Are Lakehouses a joke or is Databricks the endgame??

The Orchestra Data Leadership Newsletter • 0 implied HN points • 19 Oct 23

🕹 Technology Data Engineering Data Warehousing Machine Learning Real-Time Processing

Considering the evolution of data engineering tools and software can be likened to the concept of limits in mathematics, where processes tend to 'streaming' use cases and Lakehouses play a role in this transition.
Databricks, developed by the creators of Apache Spark, excels in loading data from Data Lakes, handling schemas, and treating data sources as streams, making it a valuable tool for data processing.
While Databricks offers advanced capabilities in data ingestion, transformation, and machine learning operations, there may still be a need for custom infrastructure for specific real-time use cases, leading to a nuanced evaluation of tools like Databricks in the data engineering landscape.

Dataflow 101: Exploring Essential Modes for Efficient Applications

DataSketch’s Substack • 0 implied HN points • 13 Feb 24

🕹 Technology Data Engineering APIs Databases Real-Time Processing Software Architecture

Databases are key for storing and managing data, supporting both everyday transactions and complex analysis. Using them effectively helps data engineers connect different platforms and applications.
Different data transfer methods, like REST and RPC, help systems communicate efficiently, just like a well-organized library or a quick phone call. Choosing the right method depends on the speed and precision needed for the task.
Message-passing systems allow for flexible and real-time data processing, making them great for applications like IoT or e-commerce. They help ensure communications between services happen smoothly and reliably.

Hacker News From Request to Stream

Bytewax • 0 implied HN points • 12 Oct 23

🕹 Technology Data retrieval Real-Time Processing API Integration

Polling HTTP endpoints is crucial for real-time data retrieval in industries like e-commerce and finance
Bytewax provides a mechanism for periodic input to efficiently poll and stream data in real-time from HTTP endpoints
By leveraging Python scripts and Bytewax library, developers can build comprehensive data pipelines for real-time data processing