The hottest Big Data Substack posts right now

And their main takeaways

PyIceberg: Current State and Roadmap

Ju Data Engineering Newsletter • 396 implied HN points • 28 Oct 24

Improving the user interface is crucial for more teams to use Iceberg, especially those that use Python for their data work.
PyIceberg, which is a Python implementation, is evolving quickly and currently supports various catalog and file system types.
While PyIceberg makes it easy to read and write data, it has some limitations, especially compared to using Iceberg with Spark, like handling deletes and managing metadata.

Iceberg + Single Node Engines

Ju Data Engineering Newsletter • 515 implied HN points • 17 Oct 24

🕹 Technology Data Engineering Cloud Computing Big Data Software Development Data Management

The use of Iceberg allows for separate storage and compute, making it easier to connect single-node engines to the data pipeline without needing extra steps.
There are different approaches to integrating single-node engines, including running all processes in one worker or handling each transformation with separate workers.
Partitioning data can improve efficiency by allowing independent processing of smaller chunks, which reduces the limitations of memory and speeds up data handling.

Worldbuilding with data

Data People Etc. • 231 implied HN points • 11 Feb 25

🕹 Technology Data Engineering Data science Big Data Data Visualization Analytics

Data is more powerful when it has a purpose. It should tell a clear story, otherwise it's just clutter.
Building a strong data system is like creating a world. A good structure connects different pieces and helps everyone understand the bigger picture.
Data engineering is important because it helps manage and present large amounts of information, making sure everything works smoothly and accurately.

I spent 6 hours learning how Apache Spark plans the execution for us

VuTrinh. • 659 implied HN points • 10 Sep 24

🕹 Technology Data science Software Engineering Big Data Cloud Computing Machine Learning

Apache Spark uses a system called Catalyst to plan and optimize how data is processed. This system helps make sure that queries run as efficiently as possible.
In Spark 3, a feature called Adaptive Query Execution (AQE) was added. It allows the tool to change its plans while a query is running, based on real-time data information.
Airbnb uses this AQE feature to improve how they handle large amounts of data. This lets them dynamically adjust the way data is processed, which leads to better performance.

I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

VuTrinh. • 399 implied HN points • 17 Sep 24

🕹 Technology Data Engineering Cloud Computing Database Systems Big Data

Metadata is really important because it helps organize and access data efficiently. It tells systems where files are and which ones can be ignored during processing.
Google's BigQuery uses a unique system to manage metadata that allows for quick access and analysis of huge datasets. Instead of putting metadata with the data, it keeps them separate but organized in a smart way.
The way BigQuery handles metadata improves performance by making sure that only the relevant information is accessed when running queries. This helps save time and resources, especially with very large data sets.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Uber’s Big Data Revolution: From MySQL to Hadoop and Beyond

VuTrinh. • 279 implied HN points • 14 Sep 24

🕹 Technology Data Engineering Big Data Cloud Computing Data Management Data Analytics

Uber evolved from simple data management with MySQL to a more complex system using Hadoop to handle huge amounts of data efficiently.
They faced challenges with data reliability and latency, which slowed down their ability to make quick decisions.
Uber introduced a system called Hudi that allowed for faster updates and better data management, helping them keep their data fresh and accurate.

I spent 4 hours learning Apache Iceberg. Here's what I found.

VuTrinh. • 799 implied HN points • 10 Aug 24

🕹 Technology Data Engineering Software Development Database Management Big Data Cloud Computing

Apache Iceberg is a table format that helps manage data in a data lake. It makes it easier to organize files and allows users to interact with data without worrying about how it's stored.
Iceberg has a three-layer architecture: data, metadata, and catalog, which work together to track and manage the actual data and its details. This structure allows for efficient querying and data operations.
One cool feature of Iceberg is its ability to time travel, meaning you can access previous versions of your data. This lets you see changes and retrieve earlier data as needed.

I spent 7 hours diving deep into Apache Iceberg

VuTrinh. • 339 implied HN points • 31 Aug 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Database Management

Apache Iceberg organizes data into a data layer and a metadata layer, making it easier to manage large datasets. The data layer holds the actual records, while the metadata layer keeps track of those records and their changes.
Iceberg's manifest files help improve read performance by storing statistics for multiple data files in one place. This means the reader can access all needed statistics without opening each individual data file.
Hidden partitioning in Iceberg allows users to filter data without needing extra columns, saving space. It records transformations on columns instead, helping streamline queries and manage data efficiently.

Sam Altman Wants $7 Trillion

Astral Codex Ten • 16656 implied HN points • 13 Feb 24

🕹 Technology AI Machine Learning Artificial Intelligence Big Data Computing

Sam Altman aims for $7 trillion for AI development, highlighting the drastic increase in costs and resources needed for each new generation of AI models.
The cost of AI models like GPT-6 could potentially be a hindrance to their creation, but the promise of significant innovation and industry revolution may justify the investments.
The approach to funding and scaling AI development can impact the pace of progress and the safety considerations surrounding the advancement of artificial intelligence.

How did Facebook design their Real-Time Processing ecosystem

VuTrinh. • 279 implied HN points • 17 Aug 24

🕹 Technology Data Engineering Real-Time Processing System Design Software Architecture Big Data

Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.

GroupBy #42: Paypal - Scaling Kafka

VuTrinh. • 219 implied HN points • 02 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Big Data Infrastructure

PayPal operates a massive Kafka system with over 85 clusters and handles around 1.3 trillion messages daily. They manage data growth by using multiple geographical data centers for efficiency.
To improve user experience and security, PayPal developed tools like the Kafka Config Service for easier broker management and added access control lists to restrict who can connect to their Kafka clusters.
PayPal focuses on automation and monitoring, implementing systems to quickly patch vulnerabilities and manage topics, while also optimizing metrics to quickly identify issues with their Kafka platform.

How Twitter processes 4 billion events in real-time daily

VuTrinh. • 339 implied HN points • 25 May 24

🕹 Technology Data Engineering Real-Time Processing Cloud Computing Data architecture Big Data

Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.

How does Uber build real-time infrastructure to handle petabytes of data every day?

VuTrinh. • 659 implied HN points • 23 Mar 24

🕹 Technology Data Engineering Infrastructure Real-Time Processing Open Source Big Data

Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.

GroupBy #43: Uber | Kafka - The Tiered Storage

VuTrinh. • 139 implied HN points • 09 Jul 24

🕹 Technology Data Engineering Software Development Cloud Computing Information Systems Big Data

Uber recently introduced Kafka Tiered Storage, which allows storage and compute resources to work separately. This means you can add storage without needing to upgrade processing power.
The tiered storage system has two parts: local storage for fast access and remote storage for long-term data. This setup helps manage data efficiently and keeps the local storage less cluttered.
When you need older data, it can be accessed directly from the remote storage, allowing faster performance for applications that need quick access to recent messages.

How does Uber handle petabytes of Spark shuffle data every day?

VuTrinh. • 159 implied HN points • 22 Jun 24

🕹 Technology Data Engineering Big Data Cloud Computing Software Development Distributed Systems

Uber uses a Remote Shuffle Service (RSS) to handle large amounts of Spark shuffle data more efficiently. This means data is sent to a remote server instead of being saved on local disks during processing.
By changing how data is transferred, the new system helps reduce failures and improve the lifespan of hardware. Now, servers can handle more jobs without crashing and SSDs last longer.
RSS also streamlines the process for the reduce tasks, as they now only need to pull data from one server instead of multiple ones. This saves time and resources, making everything run smoother.

The Hadoop Distributed File System

VuTrinh. • 259 implied HN points • 18 May 24

🕹 Technology Data Storage Cloud Computing Big Data Software Architecture

Hadoop Distributed File System (HDFS) is great for managing large amounts of data across many servers. It ensures data is stored reliably and can be accessed quickly.
HDFS uses a NameNode that keeps track of where data is stored and multiple DataNodes that hold actual data copies. This design helps with data management and availability.
Replication is key in HDFS, as it keeps multiple copies of data across different nodes to prevent loss. This makes HDFS robust even if some servers fail.

The Architecture of Apache Druid

VuTrinh. • 139 implied HN points • 15 Jun 24

🕹 Technology Data Engineering Data architecture Big Data

Apache Druid is built to handle real-time analytics on large datasets, making it faster and more efficient than Hadoop for certain tasks.
Druid uses a variety of node types—like real-time, historical, broker, and coordinator nodes—to manage data, process queries, and ensure everything runs smoothly.
The architecture allows for quick data retrieval while maintaining high availability and performance, making it a strong choice for applications that need fast, interactive data exploration.

Data Warehouse Analytics - Latency

Data Engineering Central • 491 implied HN points • 10 Jan 24

🕹 Technology Data Analytics Data Warehousing Latency Big Data

The business still needs dashboards.
A multitude of analytics still need to be calculated.
Analytics are still hard to get right.

Data Science Weekly - Issue 545

Data Science Weekly Newsletter • 139 implied HN points • 03 May 24

🕹 Technology Data science Artificial Intelligence Machine Learning Software Development Big Data

Reusing data analysis work can save time and help teams focus on building new capabilities instead of just repeating old ones.
Open-source models can be a better choice than proprietary ones for developing AI applications, making them cheaper and faster.
Causal machine learning helps predict treatment outcomes by personalizing clinical decisions based on individual patient data.

Modellion

davidj.substack • 179 implied HN points • 25 Nov 24

🕹 Technology Data architecture Big Data Data Modeling Database Management Cloud Computing

Medallion architecture is not just about data modeling but represents a high-level structure for organizing data processes. It helps in visualizing data flow in a project.
The architecture has three main layers: Bronze deals with cleaning and preparing data, Silver creates a structured data model, and Gold is about making data easy to access and use.
The terms Bronze, Silver, and Gold may sound appealing to non-technical users but could be more accurately described. Renaming these layers could better reflect their actual roles in data handling.

SAI Notes #07: What is a Vector Database?

SwirlAI Newsletter • 412 implied HN points • 18 Jun 23

🕹 Technology Data Engineering Machine Learning Big Data Database Management Data Storage

Vector Databases are essential for working with Vector Embeddings in Machine Learning applications.
Partitioning and Bucketing are important concepts in Spark for efficient data storage and processing.
Vector Databases have various real-life applications, from natural language processing to recommendation systems.

SAI #26: Partitioning and Bucketing in Spark (Part 1)

SwirlAI Newsletter • 373 implied HN points • 15 Apr 23

🕹 Technology Data Engineering Big Data Performance optimization Data Storage Data processing

Partitioning and bucketing are two key data distribution techniques in Spark.
Partitioning helps improve performance by allowing skipping reading the entire dataset when only a part is needed.
Bucketing is beneficial for collocating data and avoiding shuffling in operations like joins and groupBys.

Why did Databricks build the Photon engine?

VuTrinh. • 99 implied HN points • 06 Apr 24

🕹 Technology Data Engineering Software Development Cloud Computing Database Systems Big Data

Databricks created the Photon engine to simplify data management by combining the benefits of data lakes and data warehouses into one system called the Lakehouse. This makes it easier and cheaper for companies to manage their data all in one place.
Photon is designed to handle various types of raw data and is built with a vectorized approach instead of the traditional Java methods. This means it can work faster and better with different kinds of data without getting bogged down.
To ensure that existing customers using Apache Spark can easily switch to Photon, the new engine is integrated with Spark’s system. This allows users to continue running their current processes while benefiting from the speed of Photon.

Gods, slaves and playmates

imperfect offerings • 139 implied HN points • 26 Feb 24

🕹 Technology AI Big Data Automation Gender bias Ethics

The essay/post explores AI fantasies and their significance in education.
People tend to relate to synthetic models as if they have agency, even though they don't.
Big tech industry creates a narrative around AI as gods or monsters, while in reality, these AI systems are often designed to serve in subservient roles.

Sarah Chekfa: Stop Believing: The Data Center as Religious Monument

Do Not Research • 279 implied HN points • 06 Nov 23

🕹 Technology Data Centers Big Data Digital Infrastructure Environmental Impact Space Technology

Data centers are often like religious monuments, housing IT infrastructure and managing vast amounts of data that power modern life.
Big data is considered almost mythical, with beliefs and values attributed to its insights and power, leading to comparisons with religion.
Data centers have significant ecological impacts, consuming vast amounts of electricity and resources, leading to concerns over energy waste and pollution, with proposals for lunar data centers creating new environmental challenges.

A Guide to Optimising your Spark Application Performance (Part 2)

SwirlAI Newsletter • 314 implied HN points • 06 Aug 23

🕹 Technology Programming Big Data Optimization Data Storage

Choose the right file format for your data storage in Spark like Parquet or ORC for OLAP use cases.
Understand and utilize encoding techniques like Run Length Encoding and Dictionary Encoding in Parquet for efficient data storage.
Optimize Spark Executor Memory allocation and maximize the number of executors for improved application performance.

I spent 6 hours understanding the design principles of BigQuery. Here's what I found

VuTrinh. • 159 implied HN points • 20 Jan 24

🕹 Technology Data Engineering Cloud Computing Architecture Big Data

BigQuery uses SQL again after moving away from it, making data analysis fast and easy. Users can now analyze huge datasets quickly without complex coding.
It separates storage and compute resources, allowing for better performance and flexibility. This means you can scale them independently, which is very efficient.
Dremel's serverless architecture means you don’t need to manage servers. You just use SQL, and everything else is automatically handled for you.

Data Science Weekly - Issue 507

Data Science Weekly Newsletter • 279 implied HN points • 11 Aug 23

🕹 Technology Data science Machine Learning Artificial Intelligence Data Engineering Big Data

Large Language Models (LLMs) can take over some data tasks, but they won't replace all data jobs. Many tasks still need human insight and specialized skills.
Understanding machine learning theory takes a long time, but in the industry, practical implementation is often more important. It's crucial to balance theory and hands-on skills.
The new field of mechanistic interpretability is growing. Researchers are looking at how models learn and generalize, aiming to make sense of how AI works.

I spent another 8 hours understanding the design of Amazon Redshift. Here's what I found.

VuTrinh. • 79 implied HN points • 16 Mar 24

🕹 Technology Data Engineering Cloud Computing Database Systems Machine Learning Big Data

Amazon Redshift is designed as a massively parallel processing data warehouse in the cloud, making it effective for handling large data sets efficiently. It changes how data is stored and queried compared to traditional systems.
The system uses a unique compilation service that generates specific code for queries, which helps speed up processing by caching compiled code. This means Redshift can reuse code for similar queries, reducing wait times.
Redshift also uses machine learning techniques to optimize operations, such as predicting resource needs and automatically adjusting performance settings. This allows it to scale effectively and maintain high performance during heavy workloads.

BigQuery processing engine: Shuffle

VuTrinh. • 119 implied HN points • 06 Jan 24

🕹 Technology Data Engineering Big Data Cloud Computing Data processing Analytics

BigQuery uses a processing engine called Dremel, which takes inspiration from how MapReduce handles data. It improves how data is shuffled between workers for faster processing.
Traditional approaches have issues like resource fragmentation and unpredictable scaling when dealing with huge data. Dremel solves this by managing shuffle storage separately from the worker, which helps in scaling and resource management.
By separating the shuffle layer, Dremel reduces latency, improves fault tolerance, and allows for more flexible worker allocation during execution. This makes it easier to handle larger data sets efficiently.

I spent 4 hours figuring out how BigQuery executes the SQL query internally. Here's what I found.

VuTrinh. • 79 implied HN points • 24 Feb 24

🕹 Technology Data Engineering Database Systems Cloud Computing Big Data Software Development

BigQuery processes SQL queries by planning, optimizing, and executing them. It starts by validating the query and creating an efficient execution plan.
The query execution uses a dynamic tree structure that adjusts based on data characteristics. This helps to manage different types of queries more effectively.
Key components of BigQuery include the Query Master for planning, the Scheduler for assigning resources, and Worker Shards that carry out the actual computations.

GroupBy #28: Tableflow - The Stream/Table, Kafka/Iceberg Duality, Kafka tiered storage deep dive

VuTrinh. • 59 implied HN points • 26 Mar 24

🕹 Technology Data Engineering Software Development Machine Learning Cloud Computing Big Data

Tableflow allows you to easily turn Apache Kafka topics into Iceberg tables, which could change how streaming data is managed.
Kafka's new tiered storage feature helps separate compute and storage, making it easier to manage resources and keep systems running smoothly.
Data governance is important but can be lackluster if it doesn't show clear business benefits, making us rethink its role in today's data landscape.

The stream processing model behind Google Cloud Dataflow

VuTrinh. • 39 implied HN points • 27 Apr 24

🕹 Technology Data processing Cloud Computing Big Data Software Engineering Stream Processing

Google Cloud Dataflow is a service that helps process both streaming and batch data. It aims to ensure correct results quickly and cost-effectively, useful for businesses needing real-time insights.
The Dataflow model separates the logical data processing from the engine that runs it. This allows users to choose how they want to process their data while still using the same fundamental tools.
Windowing and triggers are important features in Dataflow. They help organize and manage how data is processed over time, allowing for better handling of events that come in at different times.

Data Science Weekly - Issue 498

Data Science Weekly Newsletter • 219 implied HN points • 09 Jun 23

🕹 Technology Data science Machine Learning AI Big Data Data Visualization

Data modeling in data science is complex and often messy, making it hard to get reliable answers. This issue highlights the need for better practices and understanding in this area.
There are ongoing discussions about the realities of working in data science. Sharing these experiences can help others prepare for the challenges they may face.
Generative AI is a big topic right now, and there are frameworks being developed to help organizations strategize its use effectively. Exploring these can guide businesses in adopting AI responsibly.

Data Science Weekly - Issue 480

Data Science Weekly Newsletter • 279 implied HN points • 02 Feb 23

🕹 Technology Data science Machine Learning AI Programming Big Data

The newsletter is now hosted on Substack and remains free for everyone. A paid option is available for more features and interactions.
Data teams need to build trust with stakeholders to effectively measure their value and justify their budgets. Having good relationships is more important than just metrics.
Understanding MLOps is crucial for the industry. It involves not only the tools but also the culture and practices around machine learning operations.

Data Science Weekly - Issue 481

Data Science Weekly Newsletter • 239 implied HN points • 09 Feb 23

🕹 Technology Data science AI Research Big Data Machine Learning Software Development

Big Data is changing, and it's not as big a deal as we thought. Hardware is getting better faster than data sizes are growing.
Research in AI can be learned just like a sport. It's about practicing skills like designing experiments and writing papers.
Data Analytics can really help businesses understand their performance and make smarter decisions. It’s all about using data to solve problems and anticipate future issues.

You don't know this for sure: How BigQuery stores semi-structured data?

VuTrinh. • 59 implied HN points • 13 Jan 24

🕹 Technology Data Engineering Cloud Computing Big Data Data Storage Database Management

BigQuery uses a method called definition and repetition level for efficient storage of nested and repeated data. This allows for reading specific parts of data without needing to access other related data.
In columnar storage, data is organized by columns which can improve performance, especially for analytical queries, because only the needed columns are loaded.
Using this method might increase file sizes due to redundancy, but it helps reduce the input/output operations needed when accessing nested fields.

When This “Red Boy” Grows Up, AI-Based Tool 360GPT Could Give China an Edge in Cyber Conflict

Natto Thoughts • 99 implied HN points • 12 May 23

🕹 Technology AI Cybersecurity China Cyber Warfare Big Data

Qihoo 360 is developing an AI tool called 360GPT that could potentially enhance China's cyber defense capabilities.
Zhou Hongyi, the founder of Qihoo 360, is actively embracing AI technology to strengthen cybersecurity in China and prepare for cyber warfare.
There are tensions between the US and China in the cyber realm, with Qihoo 360 openly calling out US hacking activities and emphasizing the need for national preparedness in cyber warfare.

The History and Evolution of Open Table Formats - Part I

Practical Data Engineering Substack • 2 HN points • 15 Aug 24

🕹 Technology Data Management Big Data Database Systems Data Engineering

Open Table Formats have changed how we store and manage data, making it easier to work with different systems and tools without being locked into one software.
The transition from traditional databases to open table formats has increased flexibility and allowed for better collaboration across various platforms, especially in data lakes.
Despite their advantages, old formats like Hive still face issues like slow performance and over-partitioning, which can make data management challenging as companies grow.

Who Cares if Big Data Is Dead!

Machine Learning for Developers • 39 implied HN points • 23 Feb 23

🕹 Technology Data Analytics Data Quality Data science Machine Learning Big Data

Data quality and data analytics motives matter more than the size of data.
Big data may not be as prevalent as believed, with most workloads processing only a small amount of data.
Too much data can lead to legal and privacy issues, making data quality paramount.