davidj.substack

davidj.substack focuses on the exploration and application of data strategies, tools, and practices for organizational impact. It covers the evolution of the Modern Data Stack, the importance of semantic layers, data team dynamics, and tooling efficiencies to enhance data handling, analytics, and management.

Data Strategy and Analysis Data Team Management Modern Data Stack Technologies Semantic Layers in Data Analytics Data Quality and Governance Tooling and Automation in Data Management Productivity and Efficiency in Data Operations Real-time Data Processing

The hottest Substack posts of davidj.substack

And their main takeaways

DataFrame

35 implied HN points • 20 Feb 25

Polars Cloud allows for scaling across multiple machines, making it easier to handle large datasets than using just a single machine. This helps in processing data faster and more efficiently.
Polars is simpler to use compared to Pandas and often performs better, especially when transforming data for machine learning tasks. It supports familiar methods that many users already know.
Unlike SQL, which runs well on cloud services, using Pandas and R for large-scale transformations has been challenging. The new Polars Cloud aims to bridge this gap, providing more scalable solutions.

SDF

59 implied HN points • 12 Feb 25

🕹 Technology Data Tools Software Development Data Engineering Machine Learning Analytics

SDF and SQLMesh are alternatives to dbt for data transformation. They are both built with modern tech and aim to provide better ease of use and performance.
SDF has a built-in local database, allowing developers to test queries without costs from a cloud data warehouse. This can speed up development and reduce costs.
Both tools offer column-level lineage to track changes, but SQLMesh provides a better workflow for managing breaking changes. SQLMesh also has unique features like Virtual Data Environments that enhance developer experience.

The New Age of Invention

47 implied HN points • 07 Feb 25

🕹 Technology Software AI Engineering Development Innovation Tools

Building software is now much easier and cheaper because of AI tools. This means more people can try out their ideas even if they aren't experts.
People who can read and write code can now create custom software for their specific needs. This opens up possibilities for personal projects that were once too complex or costly.
The trend of making software easier to build may lead to a huge increase in the number of new inventions and tools. More ease means more experimentation and creativity happening at a faster pace.

Self-employed at heart

71 implied HN points • 22 Jan 25

💼 Business Work Culture Self-employment Motivation Career development Personal Growth

Focus on your own goals and values. Don't get distracted by what others think or want from you.
Your satisfaction at work should come from personal growth and pride, not from pleasing others.
In the long run, your self-improvement stories will matter more to future employers than just meeting company goals.

Jevons Paradox

35 implied HN points • 29 Jan 25

💼 Business Economics Investment Technology Efficiency Resource Management

Jevons Paradox shows that when something becomes cheaper to use, people tend to use more of it, which can actually lead to higher overall consumption. This means that efficiency gains may not reduce usage as expected.
When teams save money through efficiency, they're likely to spend their budgets on new projects instead of cutting costs. They want to use their saved money to create more value.
Using tools that are easier and more efficient can lead to discovering new ways to use them, increasing overall spending on those tools instead of cutting back. This often justifies bigger budgets for future projects.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

You can take your gold and shove it...

59 implied HN points • 13 Jan 25

🕹 Technology Data architecture Data processing Analytics Data Models Software Development

The gold layer in data architecture has drawbacks, including the loss of information and inflexibility for users. This means important data could be missing, and making changes is hard.
Universal semantic layers offer a better solution by allowing users to request data in plain language without complicated queries. This makes data use easier and more accessible for everyone.
Switching from a gold layer to a semantic layer can improve efficiency and user experience, as it avoids the rigid structure of the gold layer and adapts to user needs more effectively.

sqlmesh init -t dbt

179 implied HN points • 02 Dec 24

🕹 Technology Software Data Engineering Development Analytics

SQLMesh recently announced that it is backwards compatible with dbt projects. This means teams can gradually switch to SQLMesh without having to do a big migration all at once.
Using SQLMesh can help improve the clarity of data workflows and avoid broken DAGs during development. It offers features that make managing complex data stacks easier.
Migrating to SQLMesh is possible even for those who aren't very tech-savvy. The process can be simple and done in an afternoon, making it accessible for teams to test and implement.

Modellion

179 implied HN points • 25 Nov 24

🕹 Technology Data architecture Big Data Data Modeling Database Management Cloud Computing

Medallion architecture is not just about data modeling but represents a high-level structure for organizing data processes. It helps in visualizing data flow in a project.
The architecture has three main layers: Bronze deals with cleaning and preparing data, Silver creates a structured data model, and Gold is about making data easy to access and use.
The terms Bronze, Silver, and Gold may sound appealing to non-technical users but could be more accurately described. Renaming these layers could better reflect their actual roles in data handling.

sqlmesh janitor

119 implied HN points • 13 Dec 24

🕹 Technology Software Data Engineering Cloud Development

Sqlmesh offers various command-line interface commands that help manage and maintain your data projects effectively. For example, the `clean` command helps fix any issues that might arise during execution.
The new tool has unique features that improve development, like automatic data contract handling and optimized incremental models, making it easier to work with large datasets without unnecessary costs.
Competition in the data transformation space is healthy. It pushes tools like dbt and sqlmesh to improve, ultimately benefiting users by providing better features and experiences.

sqlmesh cube_generate

59 implied HN points • 16 Dec 24

🕹 Technology Software Development Data science

Building integrations can seem tough, but understanding the metadata available can simplify the process. It's important to leverage existing tools to create new functionalities efficiently.
Trying out new ideas, even if they might fail, is essential for learning and discovering possibilities. Taking small steps can help you manage potential setbacks.
Creating a command to generate projects based on existing data models can streamline workflows. It allows for easier implementation of complex data relationships when set up correctly.

sqlmesh init -t dlt --dlt-pipeline bluesky duckdb

71 implied HN points • 05 Dec 24

🕹 Technology Software Data Engineering APIs Databases Automation

Using dlt to work with Bluesky API allows for easy data extraction. It saves time by handling metadata and schema changes automatically.
dlt simplifies dealing with nested data by creating separate tables. This makes it easier to manage complex data structures.
sqlmesh can quickly generate SQL models based on dlt pipelines. This feature streamlines the workflow and reduces manual setup time.

dlt windsurfing

71 implied HN points • 04 Dec 24

🕹 Technology Software Data Engineering AI Programming APIs

dlt is a Python tool that helps organize messy data into clear, structured datasets. It's easy to use and can quickly load data from many sources.
Using AI tools like Windsurf can make coding feel more collaborative. They help you find solutions faster and reduce the burden of coding from scratch.
Storing data in formats like parquet can make processing much quicker. Simplifying your data handling can save you a lot of time and resources.

sqlmesh init duckdb

71 implied HN points • 03 Dec 24

🕹 Technology Data science Software Development APIs Analytics Cloud Computing

There's a new public repository called bluesky-data where people can collaborate and follow along with its development. It's easy to get started by setting it up on your local machine.
Using sqlmesh with the Bluesky data can provide real-time data availability, while also allowing for a more complete view of the data in a batch processing style. This means you can get both immediate updates and historical data.
It's better to start with dlt and then initialize sqlmesh within that project. This way, you can efficiently manage large datasets without needing to compute everything each time.

sqlmesh migrate

47 implied HN points • 20 Dec 24

🕹 Technology Software Data Engineering Programming Cloud Computing Analytics

If you're using dbt to run analytics, switching to sqlmesh is a good idea. It offers more features and is easy to learn while still being compatible with dbt.
sqlmesh helps manage data environments and is more comprehensive in handling analytics tasks compared to dbt. It's simpler to transition from dbt to sqlmesh than from older methods like stored procedures.
When using sqlmesh, think about where to run it and how to store its state. You have choices like using a different database or a cloud service, which can save you money and hassle.

sqlmesh plan

59 implied HN points • 10 Dec 24

🕹 Technology Software Data Management Cloud Computing Analytics Development

Virtual data environments in SQLMesh let you test changes without affecting the main data. This means you can quickly see how something would work before actually doing it.
Using snapshots, you can create different versions of data models easily. Each version is linked to a unique fingerprint, so they don't mess with each other.
Creating and managing development environments is much easier now. With just a command, you can set up a new environment that looks just like production, making development smoother.

The greatest BI tool ever

83 implied HN points • 21 Nov 24

💼 Business Data Tools Business Intelligence Technology Adoption

BI tools often get replaced every 2 to 3 years, but switching them is tough. You have to deal with many dashboards and how people have used them over time.
Many teams stick with tools they know well, like Power BI or Tableau, because of comfort and familiarity. Sometimes it’s easier to choose what they’ve seen work at past jobs.
The best BI tool really isn't a tool at all. It's about how someone uses data to make better choices and understand what's happening, with the tool just being a support for that process.

sqlmesh model kinds - 1

59 implied HN points • 06 Dec 24

🕹 Technology Data Modeling Software Development APIs Machine Learning Databases

There are different types of models in sqlmesh, such as full, view, and embedded models, each having unique functions and uses. It's important to choose the right model type based on how fresh or how often you need the data.
SCD Type 2 models are useful for managing records that change over time, as they track the history of changes. This can make analyzing data trends much easier and faster.
External models in sqlmesh allow you to reference database objects not managed by your project. This can simplify data modeling and documentation, as they automatically gather useful metadata.

sqlmesh test

47 implied HN points • 12 Dec 24

🕹 Technology Software Data Engineering Testing Development

Unit tests and data tests are different. Unit tests check if a function works right with set inputs, while data tests check if the data meets certain conditions.
Running tests locally can save costs and speed things up. If you test your code on your own machine, you don’t have to pay for the cloud data warehouse until you’re ready.
Creating external models in sqlmesh can be automated, making it easier to document source tables. You just run a command to generate the necessary files instead of doing it manually.

Breaking and Non-Breaking Changes

47 implied HN points • 11 Dec 24

🕹 Technology Software Development Data Engineering Analytics

When making changes to data models, it's important to identify if they are breaking or non-breaking changes. Breaking changes affect downstream models, while non-breaking changes do not.
SQLMesh automatically analyzes changes to understand their impact on other models. This helps developers avoid manual tracking and reduces the chances of errors.
New features in SQLMesh will allow for more precise tracking of changes at the column level. This means less unnecessary work when something minor is modified.

sqlmesh model kinds - 2

47 implied HN points • 09 Dec 24

🕹 Technology Data Management Software Development Data Warehousing APIs

There are three types of incremental models in sqlmesh: Incremental by Partition, Unique Key, and Time Range. Each type has its own unique method for handling how data updates are processed.
Incremental models can efficiently replace old data with new data, and sqlmesh offers better state management compared to other tools like dbt. This allows for smoother updates without the need for full-refresh.
Understanding how to set up these models can save time and resources. Properly configuring them allows for collaboration and clarity in data management, which is especially useful in larger teams.

Catalog of Catalogs

59 implied HN points • 14 Nov 24

🕹 Technology Data science Software Development Information Systems Data Engineering Cloud Computing

Data tools create metadata, which is important for understanding what's happening in data management. Every tool involved in data processing generates information about itself, making it a catalog.
Not all catalogs are for people. Some are meant for systems to optimize data processing and querying. These system catalogs help improve efficiency behind the scenes.
To make data more accessible, catalogs should be integrated into the tools users already work with. This way, data engineers and analysts can easily find the information they need without getting overwhelmed by unnecessary data.

pip install sqlmesh-cube

23 implied HN points • 19 Dec 24

🕹 Technology Software Programming Data Development Systems

A new package called 'sqlmesh-cube' is available for anyone to use. You can easily install it with pip.
This package helps create a CLI command that outputs JSON, showing how sqlmesh models relate to each other. It's important for building a semantic layer.
This was the author's first package, and they learned a lot about the publishing process along the way. They are open to feedback and requests for updates.

sqlmesh cube_generate build part 2

23 implied HN points • 18 Dec 24

🕹 Technology Data Models Command-line Software Development

The main goal is to create a command that generates metadata to build a semantic layer for SQL models. This is important because it helps in understanding the structure and relationships within the data.
AI can enhance the process by taking the generated metadata and improving it for better usability. Using tools like OpenAI can make the process easier and faster.
There's an ongoing focus on creating practical solutions rather than aiming for perfection. It's okay to make adjustments and improvements along the way as you learn what works best.

#150 - Back to our roots

59 implied HN points • 31 Oct 24

🕹 Technology Social media Data science Community Building Content creation Personal Development

Data Twitter was once a lively community for people interested in data, but it has changed significantly over time. People are looking for new platforms to connect and share ideas.
Blue Sky is gaining popularity as a new home for data enthusiasts, offering features that help with discoverability and community building. This makes it easier for users to engage and find relevant content.
Writing regularly has been rewarding and helpful in personal growth. It's a great way to clarify thoughts and boost confidence in communication, so everyone should consider writing for themselves.

Failing upwards

35 implied HN points • 18 Nov 24

💼 Business Risk management Corporate Governance Investment strategy Human Resources Data Analysis

Taking risks is a natural part of business. Employees at all levels face risks, and their roles should help manage those risks effectively.
Data teams need to engage with business risks and help optimize rewards. Building data infrastructure should only be a means to support this goal.
Not everyone is suited for risk-taking roles in the private sector. Some people may excel at politics but fail to deliver real results, which leads to inefficiencies in recruitment.

We're going to be alright

11 implied HN points • 07 Nov 24

📖 Philosophy Ethics Existentialism Metaphysics Epistemology Logic

Things are not always what they seem; sometimes we misinterpret situations based on limited information.
Even when it feels like everything is falling apart, there is still hope for a better future if we stay focused.
Justice may take time, but it will eventually prevail, and we must continue to work towards the goals we believe in.

What is a data product?

71 implied HN points • 15 Mar 24

🕹 Technology Data Products Interfaces Applications

A data product can take various forms and be consumed in different ways, always requiring an interface for consumption.
From raw data like CSV files to refined database tables, streams, JSON files, and ORM abstracted layers, all can be considered data products.
BI tools, AI automation, and semantic layers play crucial roles in creating consumable data products for various industries, making data more refined and accessible.

A Song of Junk and Value

95 implied HN points • 03 Jan 24

🕹 Technology Data Analytics Data Management Data Visualization Standardization Artificial Intelligence

Data dashboards can become like old, unused bookmarks, cluttering up space.
Having standard data models and a semantic layer could lead to a more efficient data analysis experience.
It's important to focus on creating value in data analysis by asking complex questions and optimizing processes.

The Modern Data Stack is Dead… Long Live the Modern Data Stack - Part 1

167 implied HN points • 19 Jul 23

🕹 Technology Data Stack Data Engineering Technology Tools Data Warehousing Business Intelligence

The Modern Data Stack (MDS) community has grown significantly over the years with various meetups and events.
Using tools like Snowflake, dbt, and Looker in the Modern Data Stack improves data capabilities and productivity.
Although some criticize the Modern Data Stack and its imperfections, it has greatly enhanced data handling and analytics for many organizations.

We don’t need data contracts

71 implied HN points • 16 Feb 24

🕹 Technology Data Management Data Contracts Data Transformation

Data teams face challenges when separated from product engineering, leading to loss of metadata and concerns about data quality. Data contracts can help address these issues by defining the nature, completeness, and format of shared data.
Integrating data professionals within product teams can enhance understanding and usage of data, reducing the need for separate contracts. This approach allows for direct-to-consumer, organic data processes.
Centralized data platform teams can establish common standards and infrastructure, enabling embedded data personnel in product teams to work efficiently. This collaborative model streamlines data transformation and enhances data accessibility.

SaaVings and Snake Oil

71 implied HN points • 07 Feb 24

💼 Business Tooling Automation Productivity Cost management Investment

Invest in tooling to improve productivity before investing in labor.
Consider the cost effectiveness of tools versus hiring when making business decisions.
Efficient tooling can increase a team's productivity and offset the need for additional labor.

#100 - Playing Offense

95 implied HN points • 15 Nov 23

💼 Business Data Management Product Development Engineering Efficiency Collaboration

Data quality starts with the Product Requirements Document and Analytics Requirements Document.
For product changes, defining data requirements through a Data Design Document is crucial.
Being part of the product development process improves efficiency, speed, and collaboration in data management.

Standard Semantics

95 implied HN points • 01 Nov 23

🕹 Technology Data Modeling Semantic Layers APIs Standardization

Having a standard interface for semantic layers is crucial to prevent failure and ensure compatibility among different layers.
SQL APIs offered by semantic layers may not be truly SQL, leading to potential confusion and challenges in querying data.
Supporting REST HTTP interfaces for semantic layers enables a broader range of use cases, including data applications for internal and external purposes.

Hard Goodbyes

143 implied HN points • 24 May 23

💼 Business Management Leadership Hiring Team Dynamics Company Culture

Leaders may face difficult decisions when letting go of team members, even if it's the right thing to do.
Communication and managing expectations are crucial in the process of letting go of team members.
It is important to prioritize hiring the right people to avoid the challenging situation of having to let someone go.

The Modern Data Stack is Dead, Long Live the Modern Data Stack - Part 2

107 implied HN points • 26 Jul 23

🕹 Technology Data Stack Orchestration Consumption Semantic Layers

The modern data stack is evolving with new tools and options for data architecture.
Key trends include the focus on data ingestion and telemetry, improved orchestration tools, and advancements in compute engines.
Data consumption is being enhanced through self-serve AI capabilities, BI tools, and free-form analyst tools, all sitting on a semantic layer.

Streaming in real life

47 implied HN points • 23 Feb 24

🕹 Technology Analytics Data Warehousing ETL Real-Time Data

Real-time data streaming from databases like MySQL to data warehouses such as Snowflake can significantly reduce analytics latency, making data processing faster and more efficient.
Streamkap offers a cost-effective solution for streaming ETL, promising to be both faster and more affordable than traditional methods like Fivetran, providing a valuable option for data professionals.
Implementing Streamkap in data architectures can lead to substantial improvements, such as reducing data update lag to under 5 minutes and delivering real-time analytics value for customers, showcasing the impact of cutting-edge data technology.

Semantic Superiority - Part 1

143 implied HN points • 22 Mar 23

🕹 Technology Data Management Business Intelligence API Decision-making

A semantic layer simplifies accessing and organizing business data by using common business terms.
Without a semantic layer, organizations risk confusion, poor decision-making, and inconsistency in data usage.
Having a well-maintained semantic layer facilitates quick decision-making, consensus building, and effective risk management.

The Human Interfaces of Data - Abhi

83 implied HN points • 13 Sep 23

🕹 Technology Data Management Product Development Business strategy Data-driven Decision Making

Abhi Sivasailam shares insights on data and growth teams' collaboration.
Processes like 'Data in the Product Life-cycle' are crucial for treating data as a product.
Involving the data team in Business Reviews and Planning Cycles is essential for data-driven decision-making.

Self vs Self

95 implied HN points • 07 Jun 23

🕹 Technology Engineering Team Management AI

Individual Contributor roles in technology allow technically skilled individuals to advance without moving into management.
Specialized IC roles, like Staff or Principal, are crucial for making better technical decisions and preventing engineering issues.
Having fewer hard-to-hire line managers and more experienced ICs can lead to better support and scaling in technical teams.

Semantic Superiority - Part 2

107 implied HN points • 29 Mar 23

🕹 Technology Data Management Analytics Security APIs

Semantic layers reduce repetitive code by providing a consistent framework for queries.
Semantic layers enhance data security by controlling access and reducing accidental exposure of sensitive data.
A semantic layer defines entities and structures, while a metrics layer is a subset that focuses mainly on defining data models.