The hottest Data Systems Substack posts right now

And their main takeaways

Beyond Big Tech: The Reality Of Data Engineering Outside Silicon Valley

SeattleDataGuy’s Newsletter • 847 implied HN points • 14 Dec 24

Working in big tech offers many advantages like better tools and a strong focus on data. This environment makes it easier to get work done quickly and efficiently.
Many companies outside big tech struggle with data because it's not their main focus. They often use a mix of different tools that don't work well together, leading to confusion.
Without a strong data leader, companies may find it hard to prioritize data spending. If data isn't tied to profits, it's tougher to justify investing time and money into it.

Life Update: Just Moved to SF Bay Area

clkao@substack • 99 implied HN points • 26 Aug 24

🕹 Technology Software Development Startup Culture Code Review Data Systems Workplace Trends

The move to the Bay Area was inspired by a feeling of belonging and the need for a supportive environment for their startup, Recce.
Recce aims to improve the code review process for data-centric software development, addressing new challenges in correctness and testing.
The writer appreciates the help from friends during the move and looks forward to sharing more about their experiences in this new chapter.

The CAP Theorem needed an update.

System Design Classroom • 359 implied HN points • 28 Apr 24

🕹 Technology Data Systems Computer Science Software Design Networking Database Management

The CAP theorem says you can have consistency, availability, or partition tolerance, but only two at a time. This means systems have to make trade-offs depending on what they prioritize.
The PACELC theorem expands on CAP by considering what happens during normal operation without network issues. It adds more options about choosing between latency and consistency.
Real-world examples, like a multiplayer game leaderboard, show how these principles apply. You can have quick updates with potential outdated info or consistent scores that take longer to change.

All you need to know about the Google File System

VuTrinh. • 119 implied HN points • 11 May 24

🕹 Technology Data Systems Distributed Computing Systems Design Fault Tolerance

Google File System (GFS) is designed to handle huge files and many users at once. Instead of overwriting data, it mainly focuses on adding new information to files.
The system uses a single master server to manage file information, making it easier to keep track of where everything is stored. Clients communicate directly with chunk servers for faster data access.
GFS prioritizes reliability by storing multiple copies of data on different chunk servers. It constantly checks for errors and can quickly restore lost or corrupted data from healthy replicas.

SAI #22: Decomposing the Data System.

SwirlAI Newsletter • 294 implied HN points • 18 Mar 23

🕹 Technology Data science Data Engineering MLOps Machine Learning Data Systems

Learning to decompose a data system is crucial for better reasoning and understanding of large infrastructure
Decomposing a data system allows for scalability, identification of bottlenecks, and total event processing latency optimization
The different layers in a data system include data ingestion, transformation, and serving layers, each with specific functions and technologies

Get a weekly roundup of the best Substack posts, by hacker news affinity:

SAI Notes #01: Watermarks in Stream Processing, SQL Query order of Execution.

SwirlAI Newsletter • 255 implied HN points • 07 May 23

🕹 Technology Data processing Stream Processing Data Engineering Data Systems

Watermarks in Stream Processing help handle event lateness and decide when to treat data as 'late data'.
In SQL Query execution, the order is FROM and JOIN, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.
To optimize SQL Queries, reduce dataset sizes for joins and use subqueries for pre-filtering.

SAI #27: Event Latency in Data Systems

SwirlAI Newsletter • 255 implied HN points • 22 Apr 23

🕹 Technology Data Systems Data Collection

Data latency can unexpectedly increase at any stage of the Data Pipeline.
It is essential to decompose the Data System into smaller elements to identify and fix bottlenecks efficiently.
Effective monitoring of each component is necessary to avoid breaches in SLAs.

Single Tenant vs Multi Tenant Architecture for SAAS Platforms [System Design Sundays]

Technology Made Simple • 139 implied HN points • 04 Dec 23

🕹 Technology System Design Software Architecture Cloud Services Data Systems

Single Tenant Architecture provides each customer their own independent database and software instance, offering security and customization like living in a detached house.
Multi-Tenant Architecture is akin to an apartment building where multiple tenants share common infrastructure, allowing for economies of scale but potentially limiting customization.
Single Tenant architecture is known for high user engagement, control, and stability, while Multi-Tenant architecture favors compliance, security, and quick onboarding for better scalability.

A Halloween fright fest

Metal Machine Music by Ben Tarnoff • 59 implied HN points • 31 Oct 19

🕹 Technology AI Ethics Cloud Computing Object Oriented Programming Data Systems

AI ethics initiatives are aiming to establish responsible rules for AI system development but can lack democratic input from those impacted by the technology.
Democratizing AI involves making decisions about values political, requiring mechanisms for collective decision-making to ensure fairness and transparency in algorithmic processes.
Kristen Nygaard, a Norwegian computer scientist, was instrumental in developing object-oriented programming and also worked to empower workers in their workplaces through understanding and influencing technology.

OLTP vs OLAP: The Core of Data Miscommunication

Data Products • 2 HN points • 23 Jun 23

🕹 Technology Data Infrastructure Data science Data Systems Data Modeling Data architecture

The difference between OLTP and OLAP systems can cause miscommunication among data producers and consumers.
OLTP systems focus on serving end users quickly with specific product features, while OLAP systems handle complex analytics by scanning large amounts of data.
Empathy and communication between OLTP and OLAP teams are crucial to building scalable data products.

A First Principles guide to Data Availability: Part 2

Matthew’s Substack • 0 implied HN points • 31 Jul 24

🕹 Technology Blockchain Web3 Data Systems Cryptography Decentralization

Data Availability (DA) is crucial for ensuring that transaction data is accessible and secure, especially as blockchain technology grows. New solutions are needed to handle increased demand without high costs.
There are two main types of DA solutions: Ordered DA, which includes consensus and provides stronger security, and DACs (Data Availability Committees), which focus on scalability and lower costs but offer less security.
Choosing the right DA solution depends on factors like transaction value, data cost, and security needs. Different use cases, like finance or gaming, may prefer different DA features.

GroupBy #21: How to design resilient and large scale data systems, What Data Modeling is NOT

VuTrinh. • 0 implied HN points • 06 Feb 24

🕹 Technology Data Systems Data Modeling Stream Processing Data Engineering Big Data

Designing data systems requires resilience and scalability, which means they should handle growth and failures efficiently.
Data modeling is more than just making diagrams; it's about understanding the entire system and how data flows within it.
Using tools like DuckDB in the browser can open up new possibilities for data processing, making it more accessible and flexible.

Data Replication 101: Strategies for Performance, Availability, and DR

DataSketch’s Substack • 0 implied HN points • 21 Feb 24

🕹 Technology Data Systems Database Management Performance optimization Cloud Computing Disaster Recovery

Data replication creates multiple copies of data to ensure it is always available and resilient against failures. This means if one server goes down, others can still keep running smoothly.
There are different strategies for data replication like master-slave and multi-master setups. Each one has its own benefits, especially when it comes to how they handle read and write operations.
Monitoring and tuning your replication setup is essential. By keeping an eye on performance and any issues, businesses can make sure their data systems run efficiently and reliably.