The Orchestra Data Leadership Newsletter

The Orchestra Data Leadership Newsletter focuses on insights and methodologies in data management and leadership, emphasizing data orchestration, governance, and the integration of emerging technologies like AI and machine learning. It covers practical guides, industry trends, and tool evaluations aimed at enhancing the effectiveness of data teams.

Data Governance Data Orchestration Machine Learning and AI Data Engineering Practices Data Product Management Data Quality Web Scraping Data Architecture Cloud Infrastructure Data Tools Evaluation Data Pipelines Data Strategy Generative AI in Data Engineering

The hottest Substack posts of The Orchestra Data Leadership Newsletter

And their main takeaways

Running dbt Core on EC2

79 implied HN points • 16 May 24

Guide on running dbt Core on AWS EC2 using Orchestra, with setup and monitoring steps
Key infrastructure requirements for hosting dbt Core on EC2 with Orchestra
IAM permissions needed for setting up Orchestra and the EC2 instance to run dbt Core commands

Artificial Intelligence is ushering in a new era of web scraping possibilities

79 implied HN points • 14 May 24

🕹 Technology AI Web scraping Data Engineering APIs

Artificial Intelligence is revolutionizing web scraping by offering accelerated development processes and increased adoption of scraping use-cases in Data.
The complexity of parsing HTML and the challenges associated with web scraping, such as changing schemas, time durations, and legality, can be mitigated with AI-enabled tools.
AI-enabled web scraping tools like Nimble and Diffbot provide reliable solutions for efficiently extracting data from the internet and handling challenges like managing proxies and optimizing scraping speed.

Why Alerting is Key for enabling Generative AI and Machine Learning

79 implied HN points • 23 Apr 24

🕹 Technology Data Operations AI Data Quality Data Pipelines

Alerting and governance are crucial for the success of Data and AI initiatives, as highlighted by the high failure rates of AI projects and Data Science projects not making it to production.
Building trust between Data Teams and Business Stakeholders is essential, and alerting plays a key role in this by ensuring effective communication and collaboration during data pipeline failures.
Effective alerting systems should be proactive, asset-based, and granular, allowing for quick detection and communication of issues to build trust and reliability in Data and AI products.

You just bought Snowflake. What next? Your Top 5 Priorities

59 implied HN points • 29 Apr 24

🕹 Technology Data Management Data Infrastructure Data Modeling Data Governance

Ensure rock-solid infrastructure for your Snowflake implementation to prevent pipeline failures and maintain data quality.
Set clear expectations and prioritize projects to manage scope and quality, fostering trust and collaboration.
Start thinking of data as a product during the Snowflake implementation to minimize costs, stabilize usage, and accelerate trust in the data team.

A detailed guide to running dbt Core in Production in AWS on ECS

79 implied HN points • 28 Mar 24

🕹 Technology Cloud Computing Data Engineering AWS CI/CD Orchestration

A detailed guide to running dbt Core in production in AWS on ECS is outlined, focusing on achieving cost-effective and reliable execution.
Running dbt in production is not highly compute-intensive, as it primarily serves as an orchestrator, making it more cost-efficient compared to running Python code that utilizes compute resources.
By setting up dbt Core on ECS in AWS and using Orchestra, you can achieve a scalable, cost-effective solution for self-hosting dbt Core with full visibility and control.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Organizations are on the verge of losing control of their data forever

79 implied HN points • 21 Mar 24

🕹 Technology Data Engineering Generative AI

Organizations are at risk of losing control of their data due to lack of focus on data quality and overlooking data as a value-driver.
Large Language Models (LLMs) can improve data quality control and help in automating tasks effectively with context.
Before implementing LLMs, organizations should prioritize data cleaning, auditing, and defining valuable datasets.

Why Start-up CEOs no longer require a full-blown data team

79 implied HN points • 18 Mar 24

💼 Business Start-ups Data Analytics Data Tools

CEOs are moving away from hiring full data teams and are opting for small consultancies to set up their data stack, reducing risk and cost.
One-person data teams in startups face overwhelming responsibilities, leading to chaos and potentially costly decisions.
New technologies like Orchestra help single-person data teams maintain visibility and orchestration without expensive tools, accelerating the data value businesses receive.

AI Web scraping use-cases for Data Teams: Intelligence gathering

39 implied HN points • 21 May 24

🕹 Technology AI Data Teams Web scraping Artificial Intelligence Data Engineering

Web scraping with AI can enhance intelligence gathering by efficiently collecting and processing data from various public sources on the internet.
Leveraging Large Language Models (LLMs) can improve the accuracy and robustness of web scraping systems when dealing with changes in HTML code structure.
Using tools like Nimble for web scraping allows for more efficient and accurate data collection by training models on different types of websites for specific use cases.

Effective Data Governance Can Only Exist Within Data Orchestration

99 implied HN points • 07 Feb 24

🕹 Technology Data Governance Data Orchestration Data Engineering Software Tools Security Measures

Effective data governance requires incorporating preventive measures within data orchestration layers.
Current data governance tools predominantly offer post-action analytics rather than proactive preventive measures.
By integrating role-based access control and monitoring in the orchestration layer, organizations can shift to a more proactive data governance approach.

This well-known data company could be reversing the ETL to ELT shift

79 implied HN points • 25 Feb 24

🕹 Technology Data Engineering Cloud Computing Data Integration ETL

ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) have been key data engineering paradigms, but with the rise of the cloud, the need for in-transit data transformation has decreased.
Fivetran, a widely known data company, is potentially shifting back to ETL methods by offering pre-built transformation features, effectively simplifying the data modeling process for users.
There seems to be a trend towards a possible resurgence of ETL practices in the data industry, with companies like Fivetran potentially leading the way in providing ETL-like services within their platforms.

Is the Data Orchestration category going away?

39 implied HN points • 04 May 24

🕹 Technology Data Orchestration Software Infrastructure Control Management

Data Teams still prefer classic open source tools over workflow orchestration functionality on Data and AI platforms.
The Data Orchestration category might be fading as orchestration becomes embedded in other platforms and pricing becomes a concern.
A robust system of control and management for data and AI pipelines is vital, encompassing aspects like alerting, lineage, metadata, infrastructure, and multi-tenancy support.

Microservices vs. Monolithic Approaches in Data

79 implied HN points • 17 Feb 24

🕹 Technology Data architecture Microservices Cloud Infrastructure Software Tools

The choice between microservices and monolithic architectures in data impacts the tools and solutions you choose.
Microservices allow for distributed infrastructure, specialization, and easier scaling in data architecture.
Assumptions about high interoperability, governance, and acceptable data egress and storage costs are key considerations when opting for a microservices approach.

Why Apache Iceberg is heralding a new era of change in Data Engineering

59 implied HN points • 20 Mar 24

🕹 Technology Data Engineering Open Source Data Warehousing

Apache Iceberg introduces Bring Your Own Storage (BYOS) concept, which is gaining popularity for efficient and reliable data management in distributed environments.
Key features of Apache Iceberg include Atomic Transactions, Schema Evolution, Partitioning and Sorting, Time Travel, Incremental Data Updates, Metadata Management, and Compatibility with various data processing frameworks.
Platforms like Snowflake are shifting towards supporting Iceberg due to its benefits in handling data efficiently and enabling a Bring Your Own Storage pattern.

Running dbt-core on Github Actions: a comprehensive guide

39 implied HN points • 18 Apr 24

🕹 Technology Data Transformation Automation Workflow management Orchestration

Advantages of running dbt-core on GitHub Actions include easy workflow definition in Git, immediate access to latest code, and no need to provision instances for GitHub hosted runners.
Disadvantages of running dbt-core on GitHub Actions include being limited by GitHub's workers, 'fire and forget' implementation, and overhead when connecting to external services.
GitHub Actions workflows can be triggered from external sources like orchestrators using the repository dispatch event or the workflow_dispatch event, providing flexibility in integrating GitHub's CI/CD capabilities into larger automation strategies.

Orchestra vs. Workflow Orchestration / Data Orchestration Tools

59 implied HN points • 28 Feb 24

🕹 Technology Data Orchestration Workflow Orchestration Open Source Tools Metadata

Orchestra serves as a comprehensive Data Control Panel, bridging orchestration and observability. It offers a Control Panel for Data Teams that stands out from other tools focused solely on orchestration or observability.
Orchestra integrates Git-control with a user-friendly interface and advanced scheduler functionalities, setting itself apart from open-source tools. It provides more granularity in monitoring and failure insights.
Orchestra focuses on providing a unified platform for data orchestration, observability, and operations, standing out by offering full observability, end-to-end asset-based lineage, powerful UI, hosted infrastructure, fixed pricing, and out-of-the-box integrations.

Why you need a Data Catalog to build Data Products

79 implied HN points • 26 Nov 23

🕹 Technology Data products Data Engineering BI Tools

Data catalogs are not just for enterprises but also benefit startups by driving business value.
Data catalogs help organizations manage and present their data assets in a user-friendly way for better adoption and value extraction.
Using data catalogs can simplify data access, encourage collaboration between technical and business users, and potentially enhance BI functionalities within organizations.

Vendor Lock-in Scores: A brief summary

59 implied HN points • 02 Jan 24

🕹 Technology Data Software SaaS

Vendor lock-in is an assessment of present gain versus future risk in the world of data, software, and cloud services.
Key considerations include migration risk, migration cost, and pricing cost when assessing vendor lock-in.
Factors like data portability, integration, service and support, and community strength play a significant role in evaluating vendor lock-in risks when choosing a SaaS provider.

What is Data Orchestration and why is it misunderstood?

39 implied HN points • 28 Jan 24

🕹 Technology Data Orchestration Workflow Orchestration Data Management Data Governance

Data orchestration is often confused with workflow orchestration, but it involves more than just triggering and monitoring tasks; it includes reliably and efficiently moving data into production.
Reliably and efficiently releasing data into production is complex and involves elements like data movement, transformation, environment management, role-based access control, and data observability.
Implementing end-to-end and holistic data orchestration offers transformative benefits such as intelligent metadata gathering, data lineage, environment management, data product enablement, and cross-functional collaboration for scalable data operations.

Simple, Modern, and Modular: Data Stacks for scrappy Businesses

39 implied HN points • 12 Jan 24

🕹 Technology Data Warehousing Business Intelligence

Building data stacks for businesses involves using core software like Snowflake and Databricks, focusing on delivering business value efficiently.
The recommended tools include DIY cloud solutions for streaming, Snowflake for transformations, and BigQuery or Snowflake for storage/warehouse needs.
Using a comprehensive tool like Orchestra can facilitate end-to-end data pipeline management, without requiring a large data team and providing cost-effective solutions.

A dbt cloud alternative with no-code ELT: end-to-end data pipelines in Coalesce on Snowflake

39 implied HN points • 09 Jan 24

🕹 Technology Data Pipelines Data Transformation ETL

The article discusses building a data release pipeline to analyze Hubspot data using Coalesce, a no-code ELT tool on Snowflake.
One key issue encountered was the challenges with Hubspot's data model when trying to consolidate form fill data and messages into a meaningful view.
Setting up Coalesce involves defining storage mappings, granting access to Coalesce users, and carefully handling environments to prevent data overwriting when working between development and production.

How I use Gen AI as a Data Engineer

19 implied HN points • 05 Apr 24

🕹 Technology AI Data Engineering Data Pipelines Feature Engineering

Generative AI can help Data Engineers summarize vast quantities of structured and unstructured data, expanding the breadth and depth of data available.
Feature engineering using Generative AI involves ingesting unstructured data like call notes, making API calls, and transforming the data for analysis in existing pipelines.
Utilizing Generative AI for webscraping can help teams extract information efficiently from the internet, enabling monitoring of new data sources and optimizing business processes.

Why Data Teams are embracing low-code for controlling Data Release Pipelines

39 implied HN points • 30 Dec 23

🕹 Technology Data Teams ETL Workflow Orchestration

Data teams are increasingly turning to low-code solutions to streamline data release pipelines, utilizing tools like Airflow but questioning the need for extensive code writing and infrastructure maintenance.
The complex cloud environment has led to the development of specialized data tools, making the orchestration of data pipelines challenging and highlighting the importance of governance, data quality, and scalability.
No-code solutions like dbt core and Hightouch are already integrated into many data tools, simplifying the orchestration process and indicating that the future of data architecture might involve a combination of workflow orchestrators and efficient data quality checks.

Column level lineage is out: AI is in

39 implied HN points • 19 Dec 23

🕹 Technology AI Data Debugging Metadata Automation

Column-level lineage tools were popular in 2021 but might be replaced by AI for debugging data pipelines more efficiently.
AI models like GPT can quickly pinpoint reasons for test failures and offer actionable insights beyond what traditional lineage tools provide.
Services integrating AI with metadata can give better visibility and accurate debugging solutions for data and analytics engineers compared to column-level lineage tools.

2024 Data Orchestration Pricing Deep Dive

19 implied HN points • 12 Mar 24

🕹 Technology Data Orchestration Open Source Tools Managed Services Pricing Models Comparisons

Understanding the pricing of data orchestration tools is crucial for managing costs efficiently in data pipelines.
Consider the trade-offs between self-hosted open-source options like Airflow, Prefect, Dagster, Mage, and managed services like MWAA, Cloud Composer, Astronomer, Prefect Cloud, and Dagster Cloud.
Orchestra offers fixed pricing based on the number of pipelines and tasks, providing certainty in costs, potential savings, and efficiency gains for data teams.

Introducing Orchestra: rapidly Build and Monitor Data and AI Products

19 implied HN points • 07 Mar 24

🕹 Technology Data Management Artificial Intelligence Software Engineering Data Visualization Cloud Computing

Launching a free tier for Orchestra, a tool to build and monitor data and AI products, offering a lightweight approach to improving business value and AI integration.
Addressing the challenges faced by data teams in balancing business value and software engineering best practices through tools like Nessie, dbt, and emerging 'as-code' BI platforms.
Providing an end-to-end platform with features like declarative pipelines, data quality monitoring, granular alert control, and asset-based data lineage to empower data teams in accelerating their initiatives.

The Data Market is not consolidating | It's growing

19 implied HN points • 23 Jan 24

🕹 Technology Data Tools

The data market is not consolidating; it's expanding with many players offering differentiated products and little consolidation happening.
There is a growing complexity in data operations, leading to the necessity of more specialized tools rather than all-in-one platforms.
The future of the data market may see a trend towards out-the-box connectivity to address the increasing complexity and interoperability challenges faced by data teams.

The Data Hierarchy of Needs

19 implied HN points • 26 Nov 23

🕹 Technology Data Engineering Data science Data Transformation

Data can be structured in a hierarchy similar to Maslow's Hierarchy of Needs, where each level is necessary for the enjoyment of the level above it. This concept applies to data engineering pipelines.
Data pipelines are crucial for deriving business value, even if they are complex and not directly visible. Architectural considerations and infrastructure choices play a significant role in making data a priority in a business.
When considering data infrastructure, such as data ingestion tools, cloud warehouses, BI tools, and others, it's important to plan the entire stack and not just jump to specific infrastructure. Consider aspects like version control, security, integration, and orchestration.

The hottest SQL tools you have no use for

19 implied HN points • 16 Nov 23

🕹 Technology Data Engineering SQL Data Tools

SQL is a powerful data manipulation tool that has different dialects and evolved over time to fit various database software needs.
New SQL tools like dbt, SQLMesh, and Semantic Data Fabric aim to improve data testing, quality, and governance in data engineering processes.
The value in data engineering lies more in processes, culture, and diligence, rather than solely relying on fancy tools to prevent mistakes.

Zero ELT could be the death of the Modern Data Stack

19 implied HN points • 13 Nov 23

🕹 Technology Data Analysis Data Tools Data Management Data Integration

Zero ELT aims to streamline data processing by eliminating traditional extraction, loading, and transformation tools.
Zero ELT tools are evolving to focus more on use-case specialization rather than functional grounds, leading to a trade-off between stack complexity and having the best tool for the job.
Zero ELT tools, while promising in simplifying processes, may create data silos, lack interoperability with other tools, and bring about stack complexity issues.

Should Data Teams care about Data Contracts?

19 implied HN points • 05 Nov 23

🕹 Technology Data Teams Data Contracts Data Engineering Software Engineering Data Quality

Consider data contracts if your internal data changes often to ensure collaboration between software engineering and data engineering teams.
If you have important metrics that depend on software engineering actions, like defining 'Active Users,' data contracts can help maintain data quality.
In cases where software engineering and data engineering roles overlap, implementing data contracts can streamline data ingestion processes and improve data quality.

This once hot data trend from a few months ago got resuscitated by our industry’s data sweetheart

19 implied HN points • 27 Oct 23

🕹 Technology Data Management Data Infrastructure Data Orchestration Software Tools

Data Mesh is a decentralized approach to enterprise data management, focusing on distributed datasets and data ownership within domains.
DBT Mesh is a set of features that allow multiple teams to work on dbt projects with less friction, enabling separate repositories and orchestration capabilities.
Having separate dbt jobs run across projects on a schedule is limited, requiring external workflow orchestration tools for more flexibility.

Why Snowflake’s Clone command changes the game for CI/CD in Data

19 implied HN points • 26 Oct 23

🕹 Technology DataOps CI/CD Data Engineering Data Quality

The Snowflake Clone command allows for cheap and quick testing of data during Continuous Integration flows, showing significant cost and time improvement compared to traditional create table commands.
Continuous Deployment can be facilitated through Snowflake clones, as they are relatively inexpensive, removing cost barriers for Data Teams in implementing effective CD processes.
The Clone command, available since at least 2016, is invaluable for Data Teams as it enables CI/CD pipelines, essential for deploying data into production reliably and efficiently via a data release pipeline.

CI / CD for Data Engineering (Python)

19 implied HN points • 22 Oct 23

🕹 Technology Data Engineering CI/CD Python DevOps Unit Testing

Understanding basic CI / CD for Python code in a Data Engineering context is crucial for Data Engineering Leaders.
For unit tests, use pytest to ensure functions work correctly, and for integration tests, test connections to third-party APIs.
Implementing CI / CD involves writing code, testing and linting locally, and then deploying to a merge environment to ensure code compatibility.

Open-source vs. managed data architectures — which one should you choose?

1 HN point • 29 May 24

🕹 Technology Data architecture Data Orchestration

Understanding the total cost of ownership is crucial when choosing between open-source and managed data architectures.
Leveraging open-source software can offer cost benefits, but it also comes with risks like lack of support and high maintenance requirements.
Using managed data architecture tools like Rivery and Orchestra can minimize total cost of ownership, provide scalability, and offer simplicity in maintaining data operations.

Who is Sridhar Ramaswamy and what does this mean for Snowflake? [Substack edition]

0 implied HN points • 15 Apr 24

💼 Business Technology Finance Data Analytics AI Cloud Computing

Sridhar Ramaswamy takes over as Snowflake's CEO, bringing a fresh perspective after Frank Slootman's departure.
Snowflake is consolidating the 'Data Plane' within their platform, offering features like anomaly detection and data quality testing.
Snowflake aims to democratize AI, providing easy access to AI services using data within the Snowflake platform.

Data Leadership #3 Understanding Open Source and Open Core

0 implied HN points • 13 Oct 23

🕹 Technology Data Open Source Economics Market Dynamics Software Development

Not all open source software is equal; some may have hidden dependencies and limitations.
Open source software is like a public good, free for all to use, and can benefit society by encouraging contributions for the greater good.
Open-core projects, although open-source to an extent, operate with a profit motive by offering certain features as paid, leading to potential vendor lock-in and disappointment for users.

Data Leadership #2 Understanding Data Lake Architecture

0 implied HN points • 08 Oct 23

🕹 Technology Data Management Cloud Computing Data Warehousing Data Analysis

Understanding the architectural structure of data lakes is crucial for data leaders to make informed decisions on data storage.
File formats play a significant role in data storage efficiency, querying capabilities, and overall costs in a data lake architecture.
Choosing between data lake providers or data warehouses can be complex due to the influence of underlying technologies, like object stores and file formats.

The Unstructured Data Funnel

0 implied HN points • 15 Dec 23

🕹 Technology Data Engineering Cloud Computing Machine Learning Data Analytics Data processing

Unstructured data, like text documents and deeply nested JSON, is a crucial component in data processing for large cloud vendors like Snowflake and Databricks. The location where unstructured data is processed within the data pipeline greatly impacts the compute costs and revenue for these companies.
Processing unstructured data involves a series of stages, from data movement to storage in object storage, then to structured data warehouses. Each stage of this 'funnel' affects computational requirements and costs, with the most logical point for processing unstructured data being at the object storage level.
The final step in the data funnel, data activation, involves the least computational demands as it deals with cleaned and aggregated data ready for analytical applications. Thinking strategically about the processing location of unstructured data can help optimize costs and efficiency in data workflows.

Integrating the Airbyte Server with Orchestra

0 implied HN points • 05 Dec 23

🕹 Technology Data Engineering Integrations Managed Services Data Orchestration

ETLP paradigm integrates Airbyte with dbt and Orchestra for quick end-to-end data pipelines without coding.
Using a fully managed deployment approach with tools like Airbyte, dbt, and Orchestra can save time and effort compared to self-managed solutions.
For a data product with 10GB data, costs for Airbyte, dbt, and Orchestra would be around $2400 monthly, potentially more cost-effective than hosting or developer time.

The Rise of the Data Product Manager

0 implied HN points • 17 Nov 23

🕹 Technology Data Management Product Management Data Tools Data Infrastructure Marketplaces

The role of Data Product Manager is gaining importance in the data industry, with a focus on delivering value and advocating for data to drive business outcomes.
Tools like Fivetran, dbt, Snowflake, and platforms like Orchestra are simplifying data team setups and enabling Product Managers with less technical skills to handle data initiatives effectively.
Federated teams, marketplace functionalities by Databricks and Snowflake, and the evolving concept of data quality and productization are shaping the field of data management towards a more product-led approach.