The hottest Data Transformation Substack posts right now

And their main takeaways
Category
Top Technology Topics
The Orchestra Data Leadership Newsletter 39 implied HN points 18 Apr 24
  1. Advantages of running dbt-core on GitHub Actions include easy workflow definition in Git, immediate access to latest code, and no need to provision instances for GitHub hosted runners.
  2. Disadvantages of running dbt-core on GitHub Actions include being limited by GitHub's workers, 'fire and forget' implementation, and overhead when connecting to external services.
  3. GitHub Actions workflows can be triggered from external sources like orchestrators using the repository dispatch event or the workflow_dispatch event, providing flexibility in integrating GitHub's CI/CD capabilities into larger automation strategies.
davidj.substack 71 implied HN points 16 Feb 24
  1. Data teams face challenges when separated from product engineering, leading to loss of metadata and concerns about data quality. Data contracts can help address these issues by defining the nature, completeness, and format of shared data.
  2. Integrating data professionals within product teams can enhance understanding and usage of data, reducing the need for separate contracts. This approach allows for direct-to-consumer, organic data processes.
  3. Centralized data platform teams can establish common standards and infrastructure, enabling embedded data personnel in product teams to work efficiently. This collaborative model streamlines data transformation and enhances data accessibility.
The Orchestra Data Leadership Newsletter 39 implied HN points 09 Jan 24
  1. The article discusses building a data release pipeline to analyze Hubspot data using Coalesce, a no-code ELT tool on Snowflake.
  2. One key issue encountered was the challenges with Hubspot's data model when trying to consolidate form fill data and messages into a meaningful view.
  3. Setting up Coalesce involves defining storage mappings, granting access to Coalesce users, and carefully handling environments to prevent data overwriting when working between development and production.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
The Orchestra Data Leadership Newsletter 19 implied HN points 26 Nov 23
  1. Data can be structured in a hierarchy similar to Maslow's Hierarchy of Needs, where each level is necessary for the enjoyment of the level above it. This concept applies to data engineering pipelines.
  2. Data pipelines are crucial for deriving business value, even if they are complex and not directly visible. Architectural considerations and infrastructure choices play a significant role in making data a priority in a business.
  3. When considering data infrastructure, such as data ingestion tools, cloud warehouses, BI tools, and others, it's important to plan the entire stack and not just jump to specific infrastructure. Consider aspects like version control, security, integration, and orchestration.
Cobus Greyling on LLMs, NLU, NLP, chatbots & voicebots 19 implied HN points 06 Apr 23
  1. Visual Programming tools are being used to connect prompts in applications, making it easier to create conversational interfaces.
  2. Chaining prompts involves transforming and organizing data from responses to ensure better output and decision-making in AI applications.
  3. Good design of these tools includes making it easy to build, edit, and debug chains while also allowing users to interact flexibly with the AI.
Simplicity is SOTA 2 HN points 27 Mar 23
  1. The concept of 'embedding' in machine learning has evolved and become widely used, replacing terms like vectors and representations.
  2. Embeddings can be applied to various types of data, come from different layers in a neural network, and are not always about reducing dimensions.
  3. Defining 'embedding' has become challenging due to its widespread use, but the essence is about learned transformations that make data more useful.