davidj.substack

davidj.substack focuses on the exploration and application of data strategies, tools, and practices for organizational impact. It covers the evolution of the Modern Data Stack, the importance of semantic layers, data team dynamics, and tooling efficiencies to enhance data handling, analytics, and management.

Data Strategy and Analysis Data Team Management Modern Data Stack Technologies Semantic Layers in Data Analytics Data Quality and Governance Tooling and Automation in Data Management Productivity and Efficiency in Data Operations Real-time Data Processing

The hottest Substack posts of davidj.substack

And their main takeaways
35 implied HN points 20 Feb 25
  1. Polars Cloud allows for scaling across multiple machines, making it easier to handle large datasets than using just a single machine. This helps in processing data faster and more efficiently.
  2. Polars is simpler to use compared to Pandas and often performs better, especially when transforming data for machine learning tasks. It supports familiar methods that many users already know.
  3. Unlike SQL, which runs well on cloud services, using Pandas and R for large-scale transformations has been challenging. The new Polars Cloud aims to bridge this gap, providing more scalable solutions.

SDF

59 implied HN points 12 Feb 25
  1. SDF and SQLMesh are alternatives to dbt for data transformation. They are both built with modern tech and aim to provide better ease of use and performance.
  2. SDF has a built-in local database, allowing developers to test queries without costs from a cloud data warehouse. This can speed up development and reduce costs.
  3. Both tools offer column-level lineage to track changes, but SQLMesh provides a better workflow for managing breaking changes. SQLMesh also has unique features like Virtual Data Environments that enhance developer experience.
47 implied HN points 07 Feb 25
  1. Building software is now much easier and cheaper because of AI tools. This means more people can try out their ideas even if they aren't experts.
  2. People who can read and write code can now create custom software for their specific needs. This opens up possibilities for personal projects that were once too complex or costly.
  3. The trend of making software easier to build may lead to a huge increase in the number of new inventions and tools. More ease means more experimentation and creativity happening at a faster pace.
35 implied HN points 29 Jan 25
  1. Jevons Paradox shows that when something becomes cheaper to use, people tend to use more of it, which can actually lead to higher overall consumption. This means that efficiency gains may not reduce usage as expected.
  2. When teams save money through efficiency, they're likely to spend their budgets on new projects instead of cutting costs. They want to use their saved money to create more value.
  3. Using tools that are easier and more efficient can lead to discovering new ways to use them, increasing overall spending on those tools instead of cutting back. This often justifies bigger budgets for future projects.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
59 implied HN points 13 Jan 25
  1. The gold layer in data architecture has drawbacks, including the loss of information and inflexibility for users. This means important data could be missing, and making changes is hard.
  2. Universal semantic layers offer a better solution by allowing users to request data in plain language without complicated queries. This makes data use easier and more accessible for everyone.
  3. Switching from a gold layer to a semantic layer can improve efficiency and user experience, as it avoids the rigid structure of the gold layer and adapts to user needs more effectively.
179 implied HN points 02 Dec 24
  1. SQLMesh recently announced that it is backwards compatible with dbt projects. This means teams can gradually switch to SQLMesh without having to do a big migration all at once.
  2. Using SQLMesh can help improve the clarity of data workflows and avoid broken DAGs during development. It offers features that make managing complex data stacks easier.
  3. Migrating to SQLMesh is possible even for those who aren't very tech-savvy. The process can be simple and done in an afternoon, making it accessible for teams to test and implement.
179 implied HN points 25 Nov 24
  1. Medallion architecture is not just about data modeling but represents a high-level structure for organizing data processes. It helps in visualizing data flow in a project.
  2. The architecture has three main layers: Bronze deals with cleaning and preparing data, Silver creates a structured data model, and Gold is about making data easy to access and use.
  3. The terms Bronze, Silver, and Gold may sound appealing to non-technical users but could be more accurately described. Renaming these layers could better reflect their actual roles in data handling.
119 implied HN points 13 Dec 24
  1. Sqlmesh offers various command-line interface commands that help manage and maintain your data projects effectively. For example, the `clean` command helps fix any issues that might arise during execution.
  2. The new tool has unique features that improve development, like automatic data contract handling and optimized incremental models, making it easier to work with large datasets without unnecessary costs.
  3. Competition in the data transformation space is healthy. It pushes tools like dbt and sqlmesh to improve, ultimately benefiting users by providing better features and experiences.
59 implied HN points 16 Dec 24
  1. Building integrations can seem tough, but understanding the metadata available can simplify the process. It's important to leverage existing tools to create new functionalities efficiently.
  2. Trying out new ideas, even if they might fail, is essential for learning and discovering possibilities. Taking small steps can help you manage potential setbacks.
  3. Creating a command to generate projects based on existing data models can streamline workflows. It allows for easier implementation of complex data relationships when set up correctly.
71 implied HN points 05 Dec 24
  1. Using dlt to work with Bluesky API allows for easy data extraction. It saves time by handling metadata and schema changes automatically.
  2. dlt simplifies dealing with nested data by creating separate tables. This makes it easier to manage complex data structures.
  3. sqlmesh can quickly generate SQL models based on dlt pipelines. This feature streamlines the workflow and reduces manual setup time.
71 implied HN points 04 Dec 24
  1. dlt is a Python tool that helps organize messy data into clear, structured datasets. It's easy to use and can quickly load data from many sources.
  2. Using AI tools like Windsurf can make coding feel more collaborative. They help you find solutions faster and reduce the burden of coding from scratch.
  3. Storing data in formats like parquet can make processing much quicker. Simplifying your data handling can save you a lot of time and resources.
71 implied HN points 03 Dec 24
  1. There's a new public repository called bluesky-data where people can collaborate and follow along with its development. It's easy to get started by setting it up on your local machine.
  2. Using sqlmesh with the Bluesky data can provide real-time data availability, while also allowing for a more complete view of the data in a batch processing style. This means you can get both immediate updates and historical data.
  3. It's better to start with dlt and then initialize sqlmesh within that project. This way, you can efficiently manage large datasets without needing to compute everything each time.
47 implied HN points 20 Dec 24
  1. If you're using dbt to run analytics, switching to sqlmesh is a good idea. It offers more features and is easy to learn while still being compatible with dbt.
  2. sqlmesh helps manage data environments and is more comprehensive in handling analytics tasks compared to dbt. It's simpler to transition from dbt to sqlmesh than from older methods like stored procedures.
  3. When using sqlmesh, think about where to run it and how to store its state. You have choices like using a different database or a cloud service, which can save you money and hassle.
59 implied HN points 10 Dec 24
  1. Virtual data environments in SQLMesh let you test changes without affecting the main data. This means you can quickly see how something would work before actually doing it.
  2. Using snapshots, you can create different versions of data models easily. Each version is linked to a unique fingerprint, so they don't mess with each other.
  3. Creating and managing development environments is much easier now. With just a command, you can set up a new environment that looks just like production, making development smoother.
83 implied HN points 21 Nov 24
  1. BI tools often get replaced every 2 to 3 years, but switching them is tough. You have to deal with many dashboards and how people have used them over time.
  2. Many teams stick with tools they know well, like Power BI or Tableau, because of comfort and familiarity. Sometimes it’s easier to choose what they’ve seen work at past jobs.
  3. The best BI tool really isn't a tool at all. It's about how someone uses data to make better choices and understand what's happening, with the tool just being a support for that process.
59 implied HN points 06 Dec 24
  1. There are different types of models in sqlmesh, such as full, view, and embedded models, each having unique functions and uses. It's important to choose the right model type based on how fresh or how often you need the data.
  2. SCD Type 2 models are useful for managing records that change over time, as they track the history of changes. This can make analyzing data trends much easier and faster.
  3. External models in sqlmesh allow you to reference database objects not managed by your project. This can simplify data modeling and documentation, as they automatically gather useful metadata.
47 implied HN points 12 Dec 24
  1. Unit tests and data tests are different. Unit tests check if a function works right with set inputs, while data tests check if the data meets certain conditions.
  2. Running tests locally can save costs and speed things up. If you test your code on your own machine, you don’t have to pay for the cloud data warehouse until you’re ready.
  3. Creating external models in sqlmesh can be automated, making it easier to document source tables. You just run a command to generate the necessary files instead of doing it manually.
47 implied HN points 11 Dec 24
  1. When making changes to data models, it's important to identify if they are breaking or non-breaking changes. Breaking changes affect downstream models, while non-breaking changes do not.
  2. SQLMesh automatically analyzes changes to understand their impact on other models. This helps developers avoid manual tracking and reduces the chances of errors.
  3. New features in SQLMesh will allow for more precise tracking of changes at the column level. This means less unnecessary work when something minor is modified.
47 implied HN points 09 Dec 24
  1. There are three types of incremental models in sqlmesh: Incremental by Partition, Unique Key, and Time Range. Each type has its own unique method for handling how data updates are processed.
  2. Incremental models can efficiently replace old data with new data, and sqlmesh offers better state management compared to other tools like dbt. This allows for smoother updates without the need for full-refresh.
  3. Understanding how to set up these models can save time and resources. Properly configuring them allows for collaboration and clarity in data management, which is especially useful in larger teams.
59 implied HN points 14 Nov 24
  1. Data tools create metadata, which is important for understanding what's happening in data management. Every tool involved in data processing generates information about itself, making it a catalog.
  2. Not all catalogs are for people. Some are meant for systems to optimize data processing and querying. These system catalogs help improve efficiency behind the scenes.
  3. To make data more accessible, catalogs should be integrated into the tools users already work with. This way, data engineers and analysts can easily find the information they need without getting overwhelmed by unnecessary data.
23 implied HN points 19 Dec 24
  1. A new package called 'sqlmesh-cube' is available for anyone to use. You can easily install it with pip.
  2. This package helps create a CLI command that outputs JSON, showing how sqlmesh models relate to each other. It's important for building a semantic layer.
  3. This was the author's first package, and they learned a lot about the publishing process along the way. They are open to feedback and requests for updates.
23 implied HN points 18 Dec 24
  1. The main goal is to create a command that generates metadata to build a semantic layer for SQL models. This is important because it helps in understanding the structure and relationships within the data.
  2. AI can enhance the process by taking the generated metadata and improving it for better usability. Using tools like OpenAI can make the process easier and faster.
  3. There's an ongoing focus on creating practical solutions rather than aiming for perfection. It's okay to make adjustments and improvements along the way as you learn what works best.
59 implied HN points 31 Oct 24
  1. Data Twitter was once a lively community for people interested in data, but it has changed significantly over time. People are looking for new platforms to connect and share ideas.
  2. Blue Sky is gaining popularity as a new home for data enthusiasts, offering features that help with discoverability and community building. This makes it easier for users to engage and find relevant content.
  3. Writing regularly has been rewarding and helpful in personal growth. It's a great way to clarify thoughts and boost confidence in communication, so everyone should consider writing for themselves.
35 implied HN points 18 Nov 24
  1. Taking risks is a natural part of business. Employees at all levels face risks, and their roles should help manage those risks effectively.
  2. Data teams need to engage with business risks and help optimize rewards. Building data infrastructure should only be a means to support this goal.
  3. Not everyone is suited for risk-taking roles in the private sector. Some people may excel at politics but fail to deliver real results, which leads to inefficiencies in recruitment.
11 implied HN points 07 Nov 24
  1. Things are not always what they seem; sometimes we misinterpret situations based on limited information.
  2. Even when it feels like everything is falling apart, there is still hope for a better future if we stay focused.
  3. Justice may take time, but it will eventually prevail, and we must continue to work towards the goals we believe in.
71 implied HN points 15 Mar 24
  1. A data product can take various forms and be consumed in different ways, always requiring an interface for consumption.
  2. From raw data like CSV files to refined database tables, streams, JSON files, and ORM abstracted layers, all can be considered data products.
  3. BI tools, AI automation, and semantic layers play crucial roles in creating consumable data products for various industries, making data more refined and accessible.
167 implied HN points 19 Jul 23
  1. The Modern Data Stack (MDS) community has grown significantly over the years with various meetups and events.
  2. Using tools like Snowflake, dbt, and Looker in the Modern Data Stack improves data capabilities and productivity.
  3. Although some criticize the Modern Data Stack and its imperfections, it has greatly enhanced data handling and analytics for many organizations.
71 implied HN points 16 Feb 24
  1. Data teams face challenges when separated from product engineering, leading to loss of metadata and concerns about data quality. Data contracts can help address these issues by defining the nature, completeness, and format of shared data.
  2. Integrating data professionals within product teams can enhance understanding and usage of data, reducing the need for separate contracts. This approach allows for direct-to-consumer, organic data processes.
  3. Centralized data platform teams can establish common standards and infrastructure, enabling embedded data personnel in product teams to work efficiently. This collaborative model streamlines data transformation and enhances data accessibility.
95 implied HN points 01 Nov 23
  1. Having a standard interface for semantic layers is crucial to prevent failure and ensure compatibility among different layers.
  2. SQL APIs offered by semantic layers may not be truly SQL, leading to potential confusion and challenges in querying data.
  3. Supporting REST HTTP interfaces for semantic layers enables a broader range of use cases, including data applications for internal and external purposes.
143 implied HN points 24 May 23
  1. Leaders may face difficult decisions when letting go of team members, even if it's the right thing to do.
  2. Communication and managing expectations are crucial in the process of letting go of team members.
  3. It is important to prioritize hiring the right people to avoid the challenging situation of having to let someone go.
107 implied HN points 26 Jul 23
  1. The modern data stack is evolving with new tools and options for data architecture.
  2. Key trends include the focus on data ingestion and telemetry, improved orchestration tools, and advancements in compute engines.
  3. Data consumption is being enhanced through self-serve AI capabilities, BI tools, and free-form analyst tools, all sitting on a semantic layer.
47 implied HN points 23 Feb 24
  1. Real-time data streaming from databases like MySQL to data warehouses such as Snowflake can significantly reduce analytics latency, making data processing faster and more efficient.
  2. Streamkap offers a cost-effective solution for streaming ETL, promising to be both faster and more affordable than traditional methods like Fivetran, providing a valuable option for data professionals.
  3. Implementing Streamkap in data architectures can lead to substantial improvements, such as reducing data update lag to under 5 minutes and delivering real-time analytics value for customers, showcasing the impact of cutting-edge data technology.
95 implied HN points 07 Jun 23
  1. Individual Contributor roles in technology allow technically skilled individuals to advance without moving into management.
  2. Specialized IC roles, like Staff or Principal, are crucial for making better technical decisions and preventing engineering issues.
  3. Having fewer hard-to-hire line managers and more experienced ICs can lead to better support and scaling in technical teams.
107 implied HN points 29 Mar 23
  1. Semantic layers reduce repetitive code by providing a consistent framework for queries.
  2. Semantic layers enhance data security by controlling access and reducing accidental exposure of sensitive data.
  3. A semantic layer defines entities and structures, while a metrics layer is a subset that focuses mainly on defining data models.