The hottest ETL Substack posts right now

ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) have been key data engineering paradigms, but with the rise of the cloud, the need for in-transit data transformation has decreased.
Fivetran, a widely known data company, is potentially shifting back to ETL methods by offering pre-built transformation features, effectively simplifying the data modeling process for users.
There seems to be a trend towards a possible resurgence of ETL practices in the data industry, with companies like Fivetran potentially leading the way in providing ETL-like services within their platforms.

The article discusses building a data release pipeline to analyze Hubspot data using Coalesce, a no-code ELT tool on Snowflake.
One key issue encountered was the challenges with Hubspot's data model when trying to consolidate form fill data and messages into a meaningful view.
Setting up Coalesce involves defining storage mappings, granting access to Coalesce users, and carefully handling environments to prevent data overwriting when working between development and production.

Data teams are increasingly turning to low-code solutions to streamline data release pipelines, utilizing tools like Airflow but questioning the need for extensive code writing and infrastructure maintenance.
The complex cloud environment has led to the development of specialized data tools, making the orchestration of data pipelines challenging and highlighting the importance of governance, data quality, and scalability.
No-code solutions like dbt core and Hightouch are already integrated into many data tools, simplifying the orchestration process and indicating that the future of data architecture might involve a combination of workflow orchestrators and efficient data quality checks.

Real-time data streaming from databases like MySQL to data warehouses such as Snowflake can significantly reduce analytics latency, making data processing faster and more efficient.
Streamkap offers a cost-effective solution for streaming ETL, promising to be both faster and more affordable than traditional methods like Fivetran, providing a valuable option for data professionals.
Implementing Streamkap in data architectures can lead to substantial improvements, such as reducing data update lag to under 5 minutes and delivering real-time analytics value for customers, showcasing the impact of cutting-edge data technology.

The tech industry is experiencing pushback against the complexity of recent years.
The gap between the cost of data solutions and their value is increasing, leading to a re-examination of needs.
Exploring old ways of solving data problems before the abundance of vendors can provide valuable insights for building more robust systems.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Stateful stream processors and streaming databases have different approaches in handling data ingestion and state persistence.
Stream processors require knowing and embedding state manipulation logic in advance, while streaming databases offer ad-hoc manipulation by consumers.
Stream processors are ideal for automated, machine-driven decision-making, while streaming databases cater to human decision-makers needing fast, ad-hoc data access.

AWS Glue is a managed service for building ETL jobs on AWS, eliminating the need to manage server infrastructure and making it easy to implement analytics pipelines.
Automating the deployment process of Glue jobs with a CI/CD pipeline, using tools like GitHub Actions, can streamline the workflow and ensure continuous deployment of ETL processes.
Using GitHub Actions, you can convert Jupyter notebooks to Python scripts, upload them to S3, update Glue jobs, and configure AWS CLI for deployment, making the process efficient and scalable.