SUP! Hubert’s Substack

SUP! Hubert’s Substack focuses on the evolving realm of real-time data solutions, exploring technologies like streaming databases, Apache Pinot for big data challenges, and advanced analytics techniques. It delves into stream processing platforms, vector databases, data lineage, near-zero ETL in data meshes, and the distinction between various data planes. The substack blends technical insights with industry trends to guide data engineers through the complexities of modern data management and analytics.

Real-Time Data Solutions Streaming Databases Big Data Analytics Stream Processing Platforms Vector Databases Data Lineage Data Mesh Architectures Materialized Views Real-Time Analytics Data Management Trends

The hottest Substack posts of SUP! Hubert’s Substack

And their main takeaways

Shift-Left Analytics

50 implied HN points • 22 Nov 24

Shift-left analytics means doing analysis early in the data process. This helps in getting insights faster and making quick decisions.
It focuses on checking data quality right away, so only reliable data is used. This leads to more accurate insights and avoids problems caused by bad data.
Collaboration between teams is encouraged in this approach. By working together from the start, everyone can ensure their analyses are useful and aligned with business goals.

Event-Driven Agent Mesh

40 implied HN points • 21 Nov 24

🕹 Technology AI Architecture Software Data processing Automation

An agent mesh is a modern system where multiple AI agents work together to handle tasks more efficiently. This helps break down complex work into smaller parts that specialized agents can manage.
The event-driven architecture allows agents to join or leave the mesh easily, making the system scalable and adaptable to changing needs. This means agents can respond quickly to new information or demands.
Using technologies like Kafka with an agent mesh enables fast communication between agents and helps ensure that no data is lost. This makes the entire system more reliable and capable of handling a lot of information at once.

1 Billion Row Challenge with Apache Pinot

71 implied HN points • 05 Jan 24

Taking on the 1 Billion Row Challenge with Apache Pinot involves using SQL to analyze data
Ingesting data into Pinot and running aggregations can produce subsecond results for a billion records
Using StarTree Index in Pinot can further optimize analytical queries for subsecond latency

Real-Time Streaming Ecosystem Part 1

122 implied HN points • 27 Apr 23

The real-time ecosystem is rapidly expanding with various solutions.
Different members are finding their specific roles within real-time solutions.
This post provides an overview of open source and vendor real-time solutions in an end-to-end analytical use case.

Easy Introduction to Vector Databases

37 HN points • 31 Jan 24

Vectors are arrays of numbers representing unstructured data like text or images.
Vector databases are designed to efficiently handle and store vector data.
pg_vector is an open-source extension that adds vector database capabilities to Postgres.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Real-Time Streaming Ecosystem - Part 3

50 implied HN points • 15 May 23

Real-time streaming platforms are distributed publish and subscribe systems for events.
The real-time ecosystem is rapidly expanding with various solutions catering to different needs.
Understanding the difference between streaming platforms and stream processing platforms is crucial.

OpenLineage with Streaming Data

40 implied HN points • 18 Mar 23

Data lineage is important for understanding the origins and processing of data products.
Data lineage is crucial for providing assurance that data is properly cleansed, enriched, and secured.
Building a complete picture of data lineage from source to sink often requires stitching together multiple lineage graphs.

Real-Time Streaming Ecosystem Part 5

30 implied HN points • 23 Jun 23

The post discusses real-time streaming ecosystem in Part 5
It covers serving real-time data to end users
The series provides analytical insights for readers

The Stream Processing Shuffle

40 implied HN points • 08 Feb 23

Confluent needs to redefine itself with a new identity that includes Flink while staying connected to Kafka.
The title of being 'the Flink company' is now open for others to seize.
Competitors should consider forming partnerships to challenge Confluent in the market.

Real-Time Streaming Ecosystem - Part 4

30 implied HN points • 01 Jun 23

This post is part of a series on the real-time streaming ecosystem.
The post covers stream processors in the real-time use case.
There is a recommendation to read previous posts to better understand the topic.

Real-Time Streaming Ecosystem - Part 2

30 implied HN points • 08 May 23

The post covers connectors, CDC, ELT, and rETL solutions.
It is part 2 of a series on the real-time streaming ecosystem.
The real-time ecosystem is rapidly growing, with many tools available in the market.

Summarization of Current 23

20 implied HN points • 03 Oct 23

The Current/Kafka Summit reflects the rapid evolution of the real-time and streaming space each year.
Over 2100 attendees were physically present at Current 23, with an additional 3.5k+ virtually attending.
The ecosystem around Current 23 is expanding, welcoming more products, solutions, and vendors.

The Streaming Plane

20 implied HN points • 21 Aug 23

The concept of 'data divide' distinguishes operational data plane from analytical data plane.
Zhamack Dehghani describes architectural data planes in the dynamic data management landscape.
Understanding the distinction between operational and analytical data planes is key in today's data management.

How to implement Near-ZeroETL in a Data Mesh

30 implied HN points • 13 Feb 23

ZeroETL integrates transactional and analytical services for real-time analytics.
ETL involves extracting, transforming, and loading data into an analytical database.
ELT involves extracting, loading, then transforming data in the OLAP using SQL.

Materialized Views

20 implied HN points • 13 Apr 23

Materialized views enable real-time analytics.
Real-time analytics involve processing information as events occur.
Consider subscribing to learn more about materialized views and analytics.

Smart Brokers

20 implied HN points • 07 Mar 23

Messaging brokers like Apache Kafka require smarter clients
HTAP databases support both OLTP and OLAP workloads with no ETL
Absorbing technologies leads to the trend of smarter data stores

Stream Processing vs Real-time OLAP vs Streaming Database

4 HN points • 19 Feb 23

Stream Processing, Real-time OLAP, and Streaming Database are key concepts in handling real-time data.
Choosing the right tool for the job is crucial in navigating the various products and vendors in the real-time data space.
Interest in real-time, streaming data has sparked many questions about processing and serving real-time data efficiently.

Easy Introduction to Real-Time RAG

1 HN point • 04 Mar 24

🕹 Technology AI Programming Data Management Machine Learning

RAG (Retrieval-Augmented Generation) enhances large language models by providing accurate responses through combining model answers with supporting research.
For real-time applications like AI chatbots using RAG, ensuring the freshness and accuracy of the data supplied to the models through continuous updates is crucial.
Utilizing vector indexes in platforms like Apache Pinot can help optimize similarity searches for tasks like finding relevant documents to enhance AI responses.

One Big Table (OBT) vs Star Schema

1 HN point • 20 Dec 23

One Big Table (OBT) stores all data in a single table, making it easy to manage and query.
Star Schema in data modeling involves identifying core domain entities and their relationships.
Operational Storage can store dimensions in OLTP database and event data in event store for quicker access.

Degrees of Decentralization in a Data Mesh

0 implied HN points • 06 Mar 24

🕹 Technology Data architecture Decentralization Infrastructure Control Streaming

Data mesh concept involves reassigning data ownership to the domain that captured the data, simplifying data sharing among domains.
In a centralized data mesh, infrastructure and self-services are centralized, making it suitable for teams early in their data mesh journey.
Peer-To-Peer Data Mesh provides complete autonomy to domains, but finding data products without a centralized location can be challenging.