SUP! Hubert’s Substack

SUP! Hubert’s Substack focuses on the evolving realm of real-time data solutions, exploring technologies like streaming databases, Apache Pinot for big data challenges, and advanced analytics techniques. It delves into stream processing platforms, vector databases, data lineage, near-zero ETL in data meshes, and the distinction between various data planes. The substack blends technical insights with industry trends to guide data engineers through the complexities of modern data management and analytics.

Real-Time Data Solutions Streaming Databases Big Data Analytics Stream Processing Platforms Vector Databases Data Lineage Data Mesh Architectures Materialized Views Real-Time Analytics Data Management Trends

The hottest Substack posts of SUP! Hubert’s Substack

And their main takeaways
71 implied HN points 05 Jan 24
  1. Taking on the 1 Billion Row Challenge with Apache Pinot involves using SQL to analyze data
  2. Ingesting data into Pinot and running aggregations can produce subsecond results for a billion records
  3. Using StarTree Index in Pinot can further optimize analytical queries for subsecond latency
37 HN points 31 Jan 24
  1. Vectors are arrays of numbers representing unstructured data like text or images.
  2. Vector databases are designed to efficiently handle and store vector data.
  3. pg_vector is an open-source extension that adds vector database capabilities to Postgres.
122 implied HN points 27 Apr 23
  1. The real-time ecosystem is rapidly expanding with various solutions.
  2. Different members are finding their specific roles within real-time solutions.
  3. This post provides an overview of open source and vendor real-time solutions in an end-to-end analytical use case.
50 implied HN points 15 May 23
  1. Real-time streaming platforms are distributed publish and subscribe systems for events.
  2. The real-time ecosystem is rapidly expanding with various solutions catering to different needs.
  3. Understanding the difference between streaming platforms and stream processing platforms is crucial.
20 implied HN points 03 Oct 23
  1. The Current/Kafka Summit reflects the rapid evolution of the real-time and streaming space each year.
  2. Over 2100 attendees were physically present at Current 23, with an additional 3.5k+ virtually attending.
  3. The ecosystem around Current 23 is expanding, welcoming more products, solutions, and vendors.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
40 implied HN points 18 Mar 23
  1. Data lineage is important for understanding the origins and processing of data products.
  2. Data lineage is crucial for providing assurance that data is properly cleansed, enriched, and secured.
  3. Building a complete picture of data lineage from source to sink often requires stitching together multiple lineage graphs.
30 implied HN points 23 Jun 23
  1. The post discusses real-time streaming ecosystem in Part 5
  2. It covers serving real-time data to end users
  3. The series provides analytical insights for readers
20 implied HN points 21 Aug 23
  1. The concept of 'data divide' distinguishes operational data plane from analytical data plane.
  2. Zhamack Dehghani describes architectural data planes in the dynamic data management landscape.
  3. Understanding the distinction between operational and analytical data planes is key in today's data management.
30 implied HN points 01 Jun 23
  1. This post is part of a series on the real-time streaming ecosystem.
  2. The post covers stream processors in the real-time use case.
  3. There is a recommendation to read previous posts to better understand the topic.
40 implied HN points 08 Feb 23
  1. Confluent needs to redefine itself with a new identity that includes Flink while staying connected to Kafka.
  2. The title of being 'the Flink company' is now open for others to seize.
  3. Competitors should consider forming partnerships to challenge Confluent in the market.
30 implied HN points 08 May 23
  1. The post covers connectors, CDC, ELT, and rETL solutions.
  2. It is part 2 of a series on the real-time streaming ecosystem.
  3. The real-time ecosystem is rapidly growing, with many tools available in the market.
30 implied HN points 13 Feb 23
  1. ZeroETL integrates transactional and analytical services for real-time analytics.
  2. ETL involves extracting, transforming, and loading data into an analytical database.
  3. ELT involves extracting, loading, then transforming data in the OLAP using SQL.
1 HN point 04 Mar 24
  1. RAG (Retrieval-Augmented Generation) enhances large language models by providing accurate responses through combining model answers with supporting research.
  2. For real-time applications like AI chatbots using RAG, ensuring the freshness and accuracy of the data supplied to the models through continuous updates is crucial.
  3. Utilizing vector indexes in platforms like Apache Pinot can help optimize similarity searches for tasks like finding relevant documents to enhance AI responses.
20 implied HN points 13 Apr 23
  1. Materialized views enable real-time analytics.
  2. Real-time analytics involve processing information as events occur.
  3. Consider subscribing to learn more about materialized views and analytics.
20 implied HN points 07 Mar 23
  1. Messaging brokers like Apache Kafka require smarter clients
  2. HTAP databases support both OLTP and OLAP workloads with no ETL
  3. Absorbing technologies leads to the trend of smarter data stores
1 HN point 20 Dec 23
  1. One Big Table (OBT) stores all data in a single table, making it easy to manage and query.
  2. Star Schema in data modeling involves identifying core domain entities and their relationships.
  3. Operational Storage can store dimensions in OLTP database and event data in event store for quicker access.
4 HN points 19 Feb 23
  1. Stream Processing, Real-time OLAP, and Streaming Database are key concepts in handling real-time data.
  2. Choosing the right tool for the job is crucial in navigating the various products and vendors in the real-time data space.
  3. Interest in real-time, streaming data has sparked many questions about processing and serving real-time data efficiently.
0 implied HN points 06 Mar 24
  1. Data mesh concept involves reassigning data ownership to the domain that captured the data, simplifying data sharing among domains.
  2. In a centralized data mesh, infrastructure and self-services are centralized, making it suitable for teams early in their data mesh journey.
  3. Peer-To-Peer Data Mesh provides complete autonomy to domains, but finding data products without a centralized location can be challenging.