The hottest Data architecture Substack posts right now

And their main takeaways

How Twitter processes 4 billion events in real-time daily

VuTrinh. • 339 implied HN points • 25 May 24

Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.

Do we need the Lakehouse architecture?

VuTrinh. • 399 implied HN points • 20 Apr 24

🕹 Technology Data architecture Data Management Machine Learning Analytics

Lakehouse architecture combines the strengths of data lakes and data warehouses. It aims to solve the problems that arise from keeping these two systems separate.
This new approach allows for better data management, including features like ACID transactions and efficient querying of big datasets. It enables real-time analytics on raw data without needing complex data movements.
With the help of technologies like Delta Lake and similar systems, the Lakehouse can handle both structured and unstructured data efficiently, making it a promising solution for modern data needs.

The Architecture of Apache Druid

VuTrinh. • 139 implied HN points • 15 Jun 24

🕹 Technology Data Engineering Data architecture Big Data

Apache Druid is built to handle real-time analytics on large datasets, making it faster and more efficient than Hadoop for certain tasks.
Druid uses a variety of node types—like real-time, historical, broker, and coordinator nodes—to manage data, process queries, and ensure everything runs smoothly.
The architecture allows for quick data retrieval while maintaining high availability and performance, making it a strong choice for applications that need fast, interactive data exploration.

Modellion

davidj.substack • 179 implied HN points • 25 Nov 24

🕹 Technology Data architecture Big Data Data Modeling Database Management Cloud Computing

Medallion architecture is not just about data modeling but represents a high-level structure for organizing data processes. It helps in visualizing data flow in a project.
The architecture has three main layers: Bronze deals with cleaning and preparing data, Silver creates a structured data model, and Gold is about making data easy to access and use.
The terms Bronze, Silver, and Gold may sound appealing to non-technical users but could be more accurately described. Renaming these layers could better reflect their actual roles in data handling.

GroupBy #35: The Netflix Data Engineering Stack, Atlassian - Evolve the data platform with a Deployment Capability

VuTrinh. • 59 implied HN points • 14 May 24

🕹 Technology Data Engineering Software Development Cloud Computing Data architecture Programming

Netflix has a strong data engineering stack that supports both batch and streaming data pipelines. It focuses on building flexible and efficient data architectures.
Atlassian has revamped its data platform to include a new deployment capability inspired by technologies like Kubernetes. This helps streamline their data management processes.
Migrating from dbt Cloud can teach valuable lessons about data development. Companies should explore different options and learn from their migration journeys.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

You can take your gold and shove it...

davidj.substack • 59 implied HN points • 13 Jan 25

🕹 Technology Data architecture Data processing Analytics Data Models Software Development

The gold layer in data architecture has drawbacks, including the loss of information and inflexibility for users. This means important data could be missing, and making changes is hard.
Universal semantic layers offer a better solution by allowing users to request data in plain language without complicated queries. This makes data use easier and more accessible for everyone.
Switching from a gold layer to a semantic layer can improve efficiency and user experience, as it avoids the rigid structure of the gold layer and adapts to user needs more effectively.

Microservices vs. Monolithic Approaches in Data

The Orchestra Data Leadership Newsletter • 79 implied HN points • 17 Feb 24

🕹 Technology Data architecture Microservices Cloud Infrastructure Software Tools

The choice between microservices and monolithic architectures in data impacts the tools and solutions you choose.
Microservices allow for distributed infrastructure, specialization, and easier scaling in data architecture.
Assumptions about high interoperability, governance, and acceptable data egress and storage costs are key considerations when opting for a microservices approach.

NVIDIA Launches H200, More Grace-Hopper

More Than Moore • 163 implied HN points • 13 Nov 23

🕹 Technology Performance analysis Cloud Computing Data architecture

NVIDIA launched the H200 product to cater to AI and HPC customers' need for fast memory.
The H200 features increased memory and bandwidth compared to its predecessor, the H100.
NVIDIA's design win involves powering the EU's first Exascale system with the new Jupiter Supercomputer.

How Netflix survived the AWS outage in 2011 [System Design Sundays]

Technology Made Simple • 39 implied HN points • 13 Feb 23

🕹 Technology Systems Design Resilience Cloud Computing Chaos Engineering Data architecture

Netflix utilized Open Connect Appliances to provide better streaming by localizing content on devices of certain ISPs.
The use of Stateless-service architecture allows any server to step in if one fails, ensuring uninterrupted service.
Netflix's redundancy strategy includes storing data in multiple zones, using 'n+1' redundancy, and employing graceful degradation techniques to maintain limited functionality in case of failure.

Open-source vs. managed data architectures — which one should you choose?

The Orchestra Data Leadership Newsletter • 1 HN point • 29 May 24

🕹 Technology Data architecture Data Orchestration

Understanding the total cost of ownership is crucial when choosing between open-source and managed data architectures.
Leveraging open-source software can offer cost benefits, but it also comes with risks like lack of support and high maintenance requirements.
Using managed data architecture tools like Rivery and Orchestra can minimize total cost of ownership, provide scalability, and offer simplicity in maintaining data operations.

OLTP vs OLAP: The Core of Data Miscommunication

Data Products • 2 HN points • 23 Jun 23

🕹 Technology Data Infrastructure Data science Data Systems Data Modeling Data architecture

The difference between OLTP and OLAP systems can cause miscommunication among data producers and consumers.
OLTP systems focus on serving end users quickly with specific product features, while OLAP systems handle complex analytics by scanning large amounts of data.
Empathy and communication between OLTP and OLAP teams are crucial to building scalable data products.

Data Leadership #4 knowing when to rip out your infrastructure

The Orchestra Data Leadership Newsletter • 0 implied HN points • 15 Oct 23

🕹 Technology Data Operations Cloud Computing Business Intelligence Data architecture

Knowing when to shift left on security is crucial to preventing data breaches and maintaining a secure network infrastructure.
Re-evaluating the usefulness and uptake of self-service analytics tools can help in optimizing resources and avoiding unnecessary costs.
Carefully analyzing cloud warehouse costs and data movement can lead to cost savings and efficient data management.

Degrees of Decentralization in a Data Mesh

SUP! Hubert’s Substack • 0 implied HN points • 06 Mar 24

🕹 Technology Data architecture Decentralization Infrastructure Control Streaming

Data mesh concept involves reassigning data ownership to the domain that captured the data, simplifying data sharing among domains.
In a centralized data mesh, infrastructure and self-services are centralized, making it suitable for teams early in their data mesh journey.
Peer-To-Peer Data Mesh provides complete autonomy to domains, but finding data products without a centralized location can be challenging.