The hottest Data Warehousing Substack posts right now

And their main takeaways

Data Warehouse Analytics - Latency

Data Engineering Central • 491 implied HN points • 10 Jan 24

The business still needs dashboards.
A multitude of analytics still need to be calculated.
Analytics are still hard to get right.

Data Warehousing Essentials: A Precursor

SeattleDataGuy’s Newsletter • 1154 implied HN points • 16 Jan 24

🕹 Technology Data Warehousing Data Modeling

Data warehousing aims to make data accessible for informed decision-making.
Dimension tables provide context to quantitative data in fact tables.
Bridge tables manage many-to-many relationships between dimensions.

Leaving product analytics

timo's substack • 314 implied HN points • 05 Jun 23

🕹 Technology Analytics Product Data Warehousing

Product analytics tools like Amplitude, Mixpanel, and Heap are evolving to offer new features like marketing attribution and user experience analytics.
New players in the market like Kubit are focusing on providing product analytics directly on cloud data warehouses.
The future of analytics is moving towards event analytics, opening up new possibilities and challenges for businesses.

I spent 5 hours learning how ClickHouse built their internal data warehouse.

VuTrinh. • 1 HN point • 21 Sep 24

🕹 Technology Data Engineering Cloud Computing Database Management Data Warehousing

ClickHouse built its internal data warehouse to better understand customer usage and improve its services. They collected data from multiple sources to gain valuable insights.
They use tools like Airflow for scheduling and Superset for data visualization, making their data processing efficient. This setup allows them to handle large volumes of data daily.
Over time, ClickHouse evolved its system by adding dbt for data transformation and improving user experiences with better SQL query tools. They also incorporated real-time data to enhance their reporting.

What happened to the data warehouse?

benn.substack • 843 implied HN points • 03 Mar 23

🕹 Technology Data Warehousing Cloud Computing Software Development Database Management Innovation

The traditional definition of a data warehouse is evolving in the modern data landscape
There is a shift towards reimagining data warehouses as more dynamic and flexible structures
Innovation in the data industry is leading to new ways of processing and storing data

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Why Apache Iceberg is heralding a new era of change in Data Engineering

The Orchestra Data Leadership Newsletter • 59 implied HN points • 20 Mar 24

🕹 Technology Data Engineering Open Source Data Warehousing

Apache Iceberg introduces Bring Your Own Storage (BYOS) concept, which is gaining popularity for efficient and reliable data management in distributed environments.
Key features of Apache Iceberg include Atomic Transactions, Schema Evolution, Partitioning and Sorting, Time Travel, Incremental Data Updates, Metadata Management, and Compatibility with various data processing frameworks.
Platforms like Snowflake are shifting towards supporting Iceberg due to its benefits in handling data efficiently and enabling a Bring Your Own Storage pattern.

sqlmesh model kinds - 2

davidj.substack • 47 implied HN points • 09 Dec 24

🕹 Technology Data Management Software Development Data Warehousing APIs

There are three types of incremental models in sqlmesh: Incremental by Partition, Unique Key, and Time Range. Each type has its own unique method for handling how data updates are processed.
Incremental models can efficiently replace old data with new data, and sqlmesh offers better state management compared to other tools like dbt. This allows for smoother updates without the need for full-refresh.
Understanding how to set up these models can save time and resources. Properly configuring them allows for collaboration and clarity in data management, which is especially useful in larger teams.

Simple, Modern, and Modular: Data Stacks for scrappy Businesses

The Orchestra Data Leadership Newsletter • 39 implied HN points • 12 Jan 24

🕹 Technology Data Warehousing Business Intelligence

Building data stacks for businesses involves using core software like Snowflake and Databricks, focusing on delivering business value efficiently.
The recommended tools include DIY cloud solutions for streaming, Snowflake for transformations, and BigQuery or Snowflake for storage/warehouse needs.
Using a comprehensive tool like Orchestra can facilitate end-to-end data pipeline management, without requiring a large data team and providing cost-effective solutions.

Common Techniques For Periodically Extracting Data From Relational Databases

Practical Data Engineering Substack • 59 implied HN points • 18 Sep 23

🕹 Technology Data Engineering Data Extraction Data Warehousing

Data extraction from relational databases is important for building data pipelines. It's key to choose the right method based on factors like the type and size of the data.
There are several techniques to detect changes in the data, including using timestamps and database triggers. These help identify what new or changed records need to be extracted.
Different data extraction patterns exist, like periodic full reloads or incremental updates. Each method has its pros and cons, and the choice depends on the specific data needs and conditions.

The Modern Data Stack is Dead… Long Live the Modern Data Stack - Part 1

davidj.substack • 167 implied HN points • 19 Jul 23

🕹 Technology Data Stack Data Engineering Technology Tools Data Warehousing Business Intelligence

The Modern Data Stack (MDS) community has grown significantly over the years with various meetups and events.
Using tools like Snowflake, dbt, and Looker in the Modern Data Stack improves data capabilities and productivity.
Although some criticize the Modern Data Stack and its imperfections, it has greatly enhanced data handling and analytics for many organizations.

Streaming in real life

davidj.substack • 47 implied HN points • 23 Feb 24

🕹 Technology Analytics Data Warehousing ETL Real-Time Data

Real-time data streaming from databases like MySQL to data warehouses such as Snowflake can significantly reduce analytics latency, making data processing faster and more efficient.
Streamkap offers a cost-effective solution for streaming ETL, promising to be both faster and more affordable than traditional methods like Fivetran, providing a valuable option for data professionals.
Implementing Streamkap in data architectures can lead to substantial improvements, such as reducing data update lag to under 5 minutes and delivering real-time analytics value for customers, showcasing the impact of cutting-edge data technology.

Microsoft Fabric Mirroring: Revolutionising Data Access and Real-Time Insights

Data Plumbers • 2 HN points • 01 Apr 24

🕹 Technology Data Analytics Data Warehousing Database Management Data Integration

Microsoft Fabric Mirroring is a transformative technology that revolutionizes data access and real-time insights in organizations.
Mirroring enables universal access to various databases, real-time data replication, and granular control over data ingestion into Microsoft Fabric's Data Warehousing experience.
With Mirroring, organizations can achieve zero-ETL insights, leverage the innovative capabilities of Fabric's OneLake repository, and bridge the gap between data and action for swift adaptation and success.

The Iceberg that sank the Snowflake?

Brick by Brick • 9 implied HN points • 01 Mar 24

🕹 Technology Data Warehousing Open Source Computing

Snowflake's stock dropped significantly after the announcement of CEO Frank Slootman's retirement, with a key concern being the impact of Apache Iceberg on moving data out of Snowflake.
Apache Iceberg is a powerful technology that allows for the efficient migration of data out of Snowflake to other systems for processing, causing revenue loss in both storage and compute for Snowflake.
The paradigm shift towards technologies like Iceberg takes time in enterprise settings but can have a significant impact, highlighting the importance of capturing the compute dollars in data processing.

Khải Trần: tư duy ngắn hạn - tư duy dài hạn qua câu chuyện xây kho dữ liệu trên mây

Thái | Hacker | Kỹ sư tin tặc • 39 implied HN points • 21 Oct 18

🕹 Technology Cloud Computing Data Warehousing Startup

Different business strategies: Amazon focused on being the first with Redshift, while Snowflake prioritized a new architecture separating computation and storage for a more scalable system.
Cloud computing advantages: Cloud services like Redshift and Snowflake offer flexibility, cost-effectiveness, and scalability compared to traditional on-premise data warehouses.
Market competition: Amazon leads the cloud market with Microsoft following closely, while Google is catching up with strong computing infrastructure despite starting later.

Why has managing email in PLG B2B SaaS become such a tedious job?

nonamevc • 3 HN points • 05 Feb 24

🕹 Technology Data Management Pricing Models Data Warehousing

There are multiple types of emails involved in PLG B2B SaaS, including transactional, product-related, and marketing campaigns.
Challenges in managing email in PLG B2B SaaS include the need for orchestration, tedious content management, and high data volume.
Companies like Inflection and Humanic offer new perspectives and solutions to address the complexities of managing email in PLG B2B SaaS.

Mastering the B2B Analytics Stack from Founding to Growth stage: A Comprehensive Guide

nonamevc • 6 HN points • 22 Mar 23

🕹 Technology Analytics SaaS CRM Data Warehousing Business Intelligence

Consider the timing and readiness of your organization before implementing new tools in the B2B analytics stack.
In the founding stage, focus on qualitative data, understanding customer needs, and building a customer profile.
During the growth stage, invest in sophisticated analytics tools, like data warehouses and experimentation platforms, to effectively manage growing data.

Are Lakehouses a joke or is Databricks the endgame??

The Orchestra Data Leadership Newsletter • 0 implied HN points • 19 Oct 23

🕹 Technology Data Engineering Data Warehousing Machine Learning Real-Time Processing

Considering the evolution of data engineering tools and software can be likened to the concept of limits in mathematics, where processes tend to 'streaming' use cases and Lakehouses play a role in this transition.
Databricks, developed by the creators of Apache Spark, excels in loading data from Data Lakes, handling schemas, and treating data sources as streams, making it a valuable tool for data processing.
While Databricks offers advanced capabilities in data ingestion, transformation, and machine learning operations, there may still be a need for custom infrastructure for specific real-time use cases, leading to a nuanced evaluation of tools like Databricks in the data engineering landscape.

Making Data; Three Data Point Thursday #87

Three Data Point Thursday • 0 implied HN points • 23 Mar 23

🕹 Technology Data science Data Engineering Open Source Licensing Data Warehousing

Remember the importance of 'making data' in addition to making data useful.
Be cautious about blindly adopting BSL licenses and consider the implications.
Consider the efficiency and flexibility of doing computing tasks on top of your data store, rather than within external tools.

Data Leadership #2 Understanding Data Lake Architecture

The Orchestra Data Leadership Newsletter • 0 implied HN points • 08 Oct 23

🕹 Technology Data Management Cloud Computing Data Warehousing Data Analysis

Understanding the architectural structure of data lakes is crucial for data leaders to make informed decisions on data storage.
File formats play a significant role in data storage efficiency, querying capabilities, and overall costs in a data lake architecture.
Choosing between data lake providers or data warehouses can be complex due to the influence of underlying technologies, like object stores and file formats.

Implementing ETL on GCP

realkinetic • 0 implied HN points • 15 Jul 20

🕹 Technology Data Analytics Cloud Computing Data Warehousing

ETL processes are vital for data analytics, involving extracting, transforming, and loading data for storage in a warehouse.
GCP offers options like Data Fusion and Cloud Dataprep for implementing ETL pipelines, catering to varying technical skill levels and preferences.
Alternative approaches on GCP for ETL include using services like Cloud Dataflow for more code-heavy processes or leveraging BigQuery for ELT if your team is SQL-focused.