The hottest Distributed Systems Substack posts right now

And their main takeaways

The Overview Of Apache Spark

VuTrinh. • 879 implied HN points • 07 Sep 24

Apache Spark is a powerful tool for processing large amounts of data quickly. It does this by using many computers to work on the data at the same time.
A Spark application has different parts, like a driver that directs processing and executors that do the work. This helps organize tasks and manage workloads efficiently.
The main data unit in Spark is called RDD, which stands for Resilient Distributed Dataset. RDDs are important because they make data processing flexible and help recover data if something goes wrong.

Scaling Distributed Systems with the Scatter-Gather Pattern

Engineering At Scale • 60 implied HN points • 15 Feb 25

🕹 Technology Distributed Systems Microservices Cloud Computing Software Design System Architecture

The Scatter-Gather pattern helps speed up data retrieval by splitting requests to multiple servers at once, rather than one after the other. This makes systems respond faster, especially when lots of data is needed.
Using this pattern can improve system efficiency by preventing wasted time waiting for responses from each service. This means the system can handle more requests at once.
However, implementing Scatter-Gather can be tricky. It requires careful handling of errors and managing different data sources to ensure the information is accurate and reliable.

Queues offer more than orderly processing

System Design Classroom • 679 implied HN points • 02 Jul 24

🕹 Technology System Design Architecture Microservices Distributed Systems Scalability

Queues help different parts of a system work independently. This means you can change one part without affecting the others, making updates easier.
They improve a system's ability to handle more users at once. You can add more servers to take in requests without needing to instantly boost how fast they are processed.
Queues also keep things running smoothly during busy times. They act like a waiting area, holding tasks so no work gets lost even if things get too hectic.

Diving Deep into LinkedIn's Data Infrastructure: My 6-Hour Learning & Key Takeaways

VuTrinh. • 299 implied HN points • 03 Aug 24

🕹 Technology Data Engineering Software Architecture Databases Distributed Systems Cloud Computing

LinkedIn's data infrastructure is organized into three main tiers: data, service, and display. This setup helps the system to scale easily without moving data around.
Voldemort is LinkedIn's key-value store that efficiently handles high-traffic queries and allows easy scaling by adding new nodes without downtime.
Databus is a change data capture system that keeps LinkedIn's databases synchronized across applications, allowing for quick updates and consistent data flow.

Apache Kafka - Overview

VuTrinh. • 539 implied HN points • 06 Jul 24

🕹 Technology Data Engineering Software Development Systems Architecture Distributed Systems

Apache Kafka is a system for handling large amounts of data messages, making it easier for companies like LinkedIn to manage and analyze user activity and other important metrics.
In Kafka, messages are organized into topics and divided into partitions, allowing for better performance and scalability. This way, different servers can handle parts of the data at once.
Kafka uses a pull model for consumers, meaning they can request data as they need it. This helps prevent overwhelming the consumers with too much data at once.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Apache Kafka - Important Designs

VuTrinh. • 259 implied HN points • 13 Jul 24

🕹 Technology Data Engineering Software Design Systems Architecture Distributed Systems Programming

Kafka uses the operating system's filesystem to store data, which helps it run faster by leveraging the page cache. This avoids the need to keep too much data in memory, making it simpler to manage.
The way Kafka reads and writes data is done in a sequential order, which is more efficient than random access. This design improves performance, as accessing data in a sequence reduces delays.
Kafka groups messages together before sending them, which helps reduce the number of requests made to the system. This batching process improves performance by allowing larger, more efficient data transfers.

Apache Kafka - Producer

VuTrinh. • 199 implied HN points • 20 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Distributed Systems Real-Time Processing

Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.

Setting the Record Straight: Debunking Anatoly's Misconceptions about Security in Distributed Ledger Networks

Dan Hughes • 339 implied HN points • 08 Jun 24

🔮 Crypto Blockchain Security Consensus Distributed Systems

The honest majority assumption is key for blockchain security. It means that most participants must act honestly to keep the network safe from attacks.
Full nodes rely on validator nodes to check the validity of transactions. If most validators are dishonest, full nodes cannot prevent issues like double spending.
Economic security is important for discouraging attacks on a network. High stakes for validators make it less likely for them to act maliciously, as the potential losses from being caught far outweigh any gains.

How does Uber handle petabytes of Spark shuffle data every day?

VuTrinh. • 159 implied HN points • 22 Jun 24

🕹 Technology Data Engineering Big Data Cloud Computing Software Development Distributed Systems

Uber uses a Remote Shuffle Service (RSS) to handle large amounts of Spark shuffle data more efficiently. This means data is sent to a remote server instead of being saved on local disks during processing.
By changing how data is transferred, the new system helps reduce failures and improve the lifespan of hardware. Now, servers can handle more jobs without crashing and SSDs last longer.
RSS also streamlines the process for the reduce tasks, as they now only need to pull data from one server instead of multiple ones. This saves time and resources, making everything run smoother.

Calm down about Service Weaver

Cloud Irregular • 1478 implied HN points • 03 Mar 23

🕹 Technology Distributed Systems Cloud Computing Programming Languages Software Development Microservices

Service Weaver is not a magic solution like some past middleware frameworks
Distributed systems are complex and need careful consideration, especially in the cloud
Service Weaver offers potential for Kubernetes deployments with Golang-first focus

Finding the CAP theorem of agents

Sunday Letters • 59 implied HN points • 02 Jun 24

🕹 Technology Software AI Distributed Systems Engineering Programming

The CAP theorem shows that in any distributed system, you can only achieve two out of three things: consistency, availability, or partition tolerance. This means when things go wrong, you have to choose which one you're willing to sacrifice.
In AI programming, there's a similar tension between using complex AI models and the need for reliable, deterministic code. Balancing these two aspects is a challenge, much like the early challenges with web applications.
As technology evolves, the understanding and frameworks around these issues may improve. Just like how programmers now design around the CAP theorem, we might see better solutions and choices for AI challenges in the future.

A glimpse of Apache Pinot, the real-time OLAP system from LinkedIn

VuTrinh. • 99 implied HN points • 30 Mar 24

🕹 Technology Data Engineering Distributed Systems

Apache Pinot is a real-time OLAP system developed by LinkedIn that allows for fast analytics on large sets of data. It can handle tens of thousands of analytical queries per second while providing near-instant results.
The architecture is divided into key components like controllers, brokers, and servers which work together to process queries and manage data efficiently. Pinot is designed to quickly ingest and query fresh data from various sources, ensuring low latency.
Pinot supports various indexing strategies, like star-tree indexes, to optimize complex queries. This enables faster query responses by pre-aggregating data, making it easier to analyze large volumes of information.

Global Incremental ID

Hung's Notes • 79 implied HN points • 13 Dec 23

🕹 Technology Data Structures Distributed Systems Software Engineering Database Management

Global Incremental IDs are important for preventing ID collisions in distributed systems, especially during tasks like data backup and event ordering.
UUID and Snowflake ID are two common types of global IDs, each with unique advantages and disadvantages. For instance, UUIDs are larger but widely used, while Snowflake IDs are smaller but more complex to generate.
Different systems, like Sonyflake and Tinyid, offer specialized methods for generating IDs, helping to ensure performance and avoiding database bottlenecks.

Scalable OLTP in the Cloud: What's the BIG DEAL?

Scattered Thoughts on Distributed Systems • 105 implied HN points • 20 Dec 23

🕹 Technology Database Systems Cloud Computing Distributed Systems

Isolation semantics in the DB and application are crucial for the scalability of OLTP systems.
Common database and application patterns may unnecessarily limit scalability.
Rethinking how databases and applications are built can lead to more scalable OLTP systems.

SRE book notes: Wrapping Up

Bit Maybe Wise • 39 implied HN points • 13 Apr 23

🕹 Technology Engineering Software System reliability Automation Distributed Systems

The SRE book includes useful additional materials like availability tables and best practices.
The book covers a wide range of topics related to Site Reliability Engineering.
It emphasizes key practices such as embracing risk, eliminating toil, and effective troubleshooting.

Merging Can Be Made Eventually Consistent

Bram’s Thoughts • 19 implied HN points • 18 Dec 23

🕹 Technology Version Control Conflicts UX Distributed Systems

In distributed version control, there's a way to ensure consistent merging regardless of the order merges are done.
File states can be represented as a set of line positions with generation counts to determine the winning state during merging.
Handling conflicts in merging requires presenting changes in the order they'll appear to everyone, not based on 'local' or 'remote' changes.

AI Pitfalls and Promises: Surprising New Findings in Cognitive Models

ppdispatch • 2 implied HN points • 01 Nov 24

🕹 Technology AI Research Machine Learning Data Analysis Distributed Systems

Chain-of-thought prompting might actually make some tasks harder for AI, especially in visual tasks where less thinking works better.
The DAWN framework allows AI agents to work together globally in a secure way, which can lead to improved collaboration.
New mesomorphic networks are great for understanding tabular data and give clearer explanations, making them useful for various applications.

A Deep Dive into the Underlying Architecture of Groq's LPU

Confessions of a Code Addict • 4 HN points • 01 Mar 24

🕹 Technology Hardware Compiler Distributed Systems Scalability Reliability

Groq's LPU showcases an innovative design departing from traditional architectures, focusing on deterministic execution for enhanced performance.
The TSP architecture achieves determinism through a simplified hardware design, enabling precise scheduling by compilers for predictable performance.
Groq's approach to creating a distributed multi-TSP system eliminates non-determinism typical in networked systems, with the compiler efficiently managing data movement.

A Gentle Introduction to Kafka API

Tributary Data • 1 HN point • 16 Apr 24

🕹 Technology Programming Data Management Distributed Systems

Kafka started at LinkedIn and later evolved into Apache Kafka, maintaining its core functionalities. Various vendors offer their versions of Kafka but ensure the Kafka API remains consistent for compatibility.
Apache Kafka acts as a distributed commit log storing messages in fault-tolerant ways, while the Kafka API is the interface used to interact with Kafka for reading, writing, and administrative operations.
Kafka's structure involves brokers forming clusters, messages with keys and values, topics grouping messages, partitions dividing topics, and replication for fault tolerance. Understanding these architectural components is vital for working effectively with Kafka.

4 essential reads for devs this week

HackerPulse Dispatch • 2 implied HN points • 12 Mar 24

🕹 Technology Programming Software Development Collaboration Distributed Systems Coding

Visualize code complexity with 'dep-tree': Tool to map file dependencies and improve project structure
C++ programming safety balance: Efficiency vs. security, the challenge of writing safe code in C++
RFC significance: Structured approach for proposing features, enhancing software quality and developer collaboration

[ChatGPT-generated] Demystifying Distributed Systems: Unveiling the Inner Workings

PseudoFreedom • 5 implied HN points • 26 May 23

🕹 Technology Software Engineering Distributed Systems Architecture Challenges Benefits

Distributed systems use interconnected computers to work as one unit, enhancing performance and scalability.
Challenges in distributed systems include network communication, data consistency, and fault tolerance.
Benefits of distributed systems include scalability, high availability, and improved performance through collective computing.

My Notes on Google's "TrueTime"

Excited Technology Rambles • 1 HN point • 04 Jun 23

🕹 Technology Distributed Systems Database Management Software Engineering

Clock synchronization is a challenging problem in distributed systems.
TrueTime intervals help prevent incorrect transaction ordering.
Having intervals provides a bound for resolving transaction conflicts.

Exploring the Eight Fallacies of Distributed Computing

Engineering At Scale • 1 HN point • 01 Jul 23

🕹 Technology Distributed Systems Networks Security Latency

Network reliability is not guaranteed, so build systems with resilience to handle failures.
Latency in data transmission is influenced by factors like distance and database optimization.
Consider security, system topology changes, and interoperability when designing distributed systems.

What's a distributed system?

Brick by Brick • 0 implied HN points • 05 Mar 24

🕹 Technology Distributed Systems Consistency Fault Tolerance Concurrency

A distributed system is a collection of components on multiple computers that appear as a single, unified system to users. They are commonly used in database and file systems.
Key characteristics of distributed systems include concurrency, scalability, fault tolerance, and decentralization, enabling efficient operation across multiple machines.
In distributed systems, concepts like fault tolerance, recovery & durability, the CAP theorem, and quorums & consensus are crucial for maintaining reliability, consistency, and coordination among nodes.

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

VuTrinh. • 0 implied HN points • 21 Nov 23

🕹 Technology Data Engineering Distributed Systems

Netflix's Psyberg is a new way for processing data that helps manage membership information better. It uses innovative methods to make data processing more efficient.
The Parquet format is great for storing data because it organizes information in a smart way. It can improve how quickly and easily data is accessed and processed.
SQL isn't the best tool for doing analytics because it was designed a long time ago. There are newer tools that fit analytics needs much better.

Mastering Apache Spark Performance: A Data Engineer's Guide to Optimization

DataSketch’s Substack • 0 implied HN points • 14 Oct 24

🕹 Technology Data Engineering Big Data Performance Tuning Cloud Computing Distributed Systems

Properly configuring resources in Spark is really important. Make sure you adjust settings like memory and cores to fit your cluster's total resources.
Good data partitioning helps Spark job performance a lot. For example, repartitioning your data based on a relevant column can lead to faster processing times.
Using broadcast joins can save time and reduce workload. When joining smaller tables, broadcasting can make the process much quicker.

The Future of Cloud Computing

Splattern • 0 implied HN points • 23 Dec 23

🕹 Technology Cloud Computing Innovation Startups Distributed Systems Business strategy

Big tech cloud companies like AWS, Azure, and Google Cloud don't really foster innovation. They were built on existing technology, and their focus is more on business strategies than improving their tech.
These companies have lost many of their original experienced employees. This means current workers might not have the skills needed to innovate in a fast-moving tech world.
Startups are emerging with new models that can offer better pricing and solutions for cloud computing. This could threaten the big tech clouds and change the landscape of cloud services.

Range Partitioning: Zero to One

aspiring.dev • 0 implied HN points • 17 Mar 24

🕹 Technology Databases Distributed Systems Data Structures Software Engineering Cloud Computing

Range partitioning splits data into key ranges to improve performance and scalability. This method helps databases manage heavy loads by distributing data efficiently.
Unlike hash partitioning, range partitioning allows for easier scaling. You can adjust the number of ranges as needed without the hassle of rewriting data.
While range partitioning is powerful, it can be tricky to implement and may struggle with very sequential workloads. Planning is necessary to avoid creating performance hotspots.