Ju Data Engineering Newsletter

The Ju Data Engineering Newsletter explores advancements in data engineering technologies, practices, and tools. It addresses the evolution of data storage, processing, and querying mechanisms, such as Pandas, DuckDB, and Apache Iceberg, focusing on performance, cost efficiency, and best practices for building and managing modern data stacks.

Data Engineering Best Practices Data Processing Technologies Cost Efficiency in Data Operations Data Storage Solutions Cloud Data Warehouses Machine Learning Applications Data Quality Management Data Pipeline Orchestration Serverless Architectures SQL and Data Transformation

The hottest Substack posts of Ju Data Engineering Newsletter

And their main takeaways

PyIceberg: Current State and Roadmap

396 implied HN points • 28 Oct 24

Improving the user interface is crucial for more teams to use Iceberg, especially those that use Python for their data work.
PyIceberg, which is a Python implementation, is evolving quickly and currently supports various catalog and file system types.
While PyIceberg makes it easy to read and write data, it has some limitations, especially compared to using Iceberg with Spark, like handling deletes and managing metadata.

Iceberg + Single Node Engines

515 implied HN points • 17 Oct 24

🕹 Technology Data Engineering Cloud Computing Big Data Software Development Data Management

The use of Iceberg allows for separate storage and compute, making it easier to connect single-node engines to the data pipeline without needing extra steps.
There are different approaches to integrating single-node engines, including running all processes in one worker or handling each transformation with separate workers.
Partitioning data can improve efficiency by allowing independent processing of smaller chunks, which reduces the limitations of memory and speeds up data handling.

Pandas v1 is dead, what's next ?

297 implied HN points • 16 Nov 23

Innovations like Pandas v2, CudF, Polars, and DuckDB are addressing the limitations faced by data engineers transitioning from Proof of Concept to production.
Engines like Pandas v2, Polars, and DuckDB are leveraging Apache Arrow for improved performance and interoperability.
Performance benchmarks using TPC-H data show varying speeds and efficiency among Pandas v2, Polars, and CudF based on dataset sizes.

Moving from BI to Data Apps (Part 1)

238 implied HN points • 29 Nov 23

Moving from BI tools to data apps involves considering different distribution requirements for internal and external use.
Cloud data warehouses may not be the most cost-effective or efficient option for feeding external applications due to cost considerations and high latency.
Off-the-shelf data backend platforms offer solutions for low-latency querying, tenant-isolated access control, and APIs for various clients.

Cannibalization in the Data Stack

158 implied HN points • 31 Jan 24

Decentralization in the data stack led to fragmented development, high setup, and maintenance costs, encouraging vendors to centralize solutions horizontally.
Orchestrators offer central control over the data stack by coordinating tasks, running computations, and integrating with tools like dbt for end-to-end lineage.
Cloud warehouses and ELTs are also expanding horizontally, with tools like Snowflake offering orchestration features and services like Airbyte integrating with dbt for transformations.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Multi-engine data stack - v0 🚀

257 implied HN points • 01 Nov 23

Teams are interested in combining different engines to reduce cost and gain flexibility.
DuckDB can offer significant cost savings compared to Snowflake for certain queries.
Data teams are beginning to have an easier time transpiling SQL or Dataframe code across different engines.

Moving from BI to Data Apps (Part 2/2)

138 implied HN points • 13 Dec 23

Developer experience was improved by allowing data engineers and application developers to work in their preferred tooling.
Cost reduction was targeted through metrics pre-aggregation and live querying optimizations.
Maintaining data integrity, especially schema drift, was highlighted as a challenge to address in future projects.

Dagster + DBT = 🚀

198 implied HN points • 22 Sep 23

DBT needs to be orchestrated with tools like Dagster to manage materializing models in a DAG structure.
Task-oriented schedulers can be limiting when working with a large number of DBT models, causing errors.
Dagster provides a declarative approach to data pipelines, allowing for freshness-based materialization and easy backfilling of data.

Manifesto for a Better Dev Experience

99 implied HN points • 10 Jan 24

Focus on improving efficiency of existing platforms over adding new tools to the stack.
Minimize time spent on low-value tasks to reduce data team costs.
Enhance developer experience with solutions for local code testing, easy environment switching, and simplifying data lookup.

A Paradigm Shift to More Affordable Data Stacks

178 implied HN points • 12 Sep 23

Using popular data warehouses like Snowflake, BigQuery, and Databricks can lead to high credit costs.
Consider if scalability is necessary for all queries to optimize costs when using data warehouses.
Explore alternative solutions like Apache Iceberg for storage and DuckDB for light compute to reduce expenses in data stacks.

Innovative Implementation of Data Contracts with DBT

198 implied HN points • 24 Jul 23

Data contracts help manage data quality by specifying data formats exchanged between providers and consumers
Data contracts should focus on identifying unusual data patterns rather than setting strict input rules
Using DBT, a data quality monitoring system can be built to continuously check data against the contract and prevent bad data from progressing further in the pipeline

The Data Stack Setup Problem

79 implied HN points • 24 Jan 24

When setting up a data platform, consider hiring a data engineer instead of a data scientist.
Choosing between off-the-shelf SaaS and self-hosted open-source tools can impact platform scalability and costs.
Consider the pros and cons of usage-based pricing versus one-off purchases when setting up a data stack.

N engines 🔜 1 language ? 🧐💭

119 implied HN points • 16 Oct 23

Building a multi-engine data stack based on Apache Iceberg is gaining traction in the industry.
Initiatives like Malloy are creating a new SQL standard for unified data access.
Ibis is offering a unified DataFrame API on top of various compute engines for easier data manipulation.

The Hidden Side of Iceberg

99 implied HN points • 23 Oct 23

Iceberg and other open table formats like Delta and Hudy bring innovation to modern data platforms
Open Table Formats provide ACID consistency, schema evolution, and time travel capabilities
Reading and writing data in Iceberg involves metadata files, manifest files, catalogs, and partition schemas

Test in Prod !! Ghost Mode Development Process.

99 implied HN points • 08 Aug 23

Developers should thoroughly and rapidly test new features before deploying them.
Using multiple development environments isolates code testing from the production environment.
Modern data platforms allow for smarter development environments where new developments can be directly tested in production.

Designing Data Platforms Like BMW 🚗 Factories

39 implied HN points • 07 Feb 24

BMW can produce customized cars at a large scale, which is rare in the industry.
Car manufacturing uses push-pull strategies, automation, and quality checks to produce high-quality customized products.
Data platform design can learn from car manufacturing in balancing speed, accuracy, and scaling by using push-pull systems.

Recap' 2023

39 implied HN points • 20 Dec 23

Shorter posts are more effective, aim for max 1000 words
Use images to engage readers before they read the text
Practical content is more valuable than theoretical for readers

S3 is the GOAT 🐐 !

39 implied HN points • 06 Dec 23

S3 is gaining popularity as a storage backend for modern applications, separating storage from compute in a serverless architecture.
Applications like Snowflake, Neon, and Warpstream are leveraging S3 for cloud data warehousing, serverless databases, and streaming services.
Advantages of building on top of S3 include durability, infinite concurrency reads, usage-based pricing, time travel for data queries, and improved security by keeping data in the user's account.

🦆 vs ❄️ ... 💸 ?

39 implied HN points • 02 Oct 23

Optimizing compute costs in data warehouses by using multiple compute engines can lead to cost savings.
Combining commercial data warehouses like Snowflake, Redshift, and BigQuery with lightweight open-source engines like DuckDB can create an efficient data stack.
By running cost-saving experiments and comparing compute times, one can estimate potential cost reductions in data warehouse operations.

To lambda or not to lambda

59 implied HN points • 10 May 23

Consider a hybrid approach of serverless and monolithic architectures for best results
Choose between lambda and container based on load profile, job profile, and organization context
Transitioning from Lambda to container is possible as service evolves and needs change

Minimal Viable Data Stack

39 implied HN points • 15 Aug 23

When building a new data platform, focus on gaining trust by delivering value with limited resources.
Prioritize creating a catalog of necessary data sources to focus on in the initial stages of building your data stack.
Ensure you have a scheduler to synchronize components of your data stack and invest time in observability, alerting, and data testing.

SQL Everywhere

39 implied HN points • 04 Jul 23

SQL is a faster and simpler way to write data transformations compared to tools like Pandas or PySpark.
Declarative data pipelines can be built in SQL, with cloud data warehouses like Snowflake embracing this paradigm.
SQL is expected to play a significant role in the future of data platforms and ecosystems, with abstractions built around its core API.

My first AI product - Interiobot.com

39 implied HN points • 04 Apr 23

The AI product Interiobot.com generates interior design concepts based on room photos.
The development of the prototype took around 5 weeks with some assistance required for front-end work.
The tech stack for Interiobot.com included React, Tailwind CSS, AWS Amplify, Step Functions, Serverless Framework, and Replicate for GPU serverless inference.

Idempotency

39 implied HN points • 14 Jan 23

Idempotency ensures the same result no matter how many times an operation is applied.
Lambda functions processing SQS messages should be idempotent to avoid duplication.
Use SQS deduplication ID, manage Lambda timeouts, and track event status in DynamoDB for idempotency.

Enabling a Data Engineering AI Agent to Learn New Frameworks

19 implied HN points • 06 Jun 23

The Data Engineering AI Agent needs to learn new frameworks to stay up-to-date.
Provide a corpus of documents to the agent in three steps: loading, selecting, and passing the relevant document.
Implement the logic using Langchain framework components for efficient document retrieval and indexing.

Data Catalogs, LLMs in Production, and More

19 implied HN points • 03 May 23

Startups can use AI to build data pipelines at a low cost and offer them as a service to corporations
Language models (LLMs) can enhance data cataloging and decentralized data management in organizations
Challenges in transitioning language models from demonstrations to production include prompt management, cost estimation, and finding the right balance between prompting and fine-tuning

GPT Autonomous Agents & Data Engineering (Part 2/2)

19 implied HN points • 26 Apr 23

Data engineering tasks can be time-consuming, especially data retrieval from APIs with various requirements.
Autonomous agents like Auto-GPT have potential to automate tasks but may need detailed prompts and have limitations.
Guiding LLMs with specific prompts is crucial for accurate results, and utilizing best-practice prompts can improve agent performance.

GPT Autonomous Agents & Data Engineering (Part 1/2)

19 implied HN points • 18 Apr 23

Autonomous agents like baby-AGI and auto-GPT involve multiple GPT sessions interacting to achieve tasks.
Langchain framework simplifies interactions between language models like GPT and data sources like APIs.
Applications of autonomous agents include coding agents, market research agents, and more in various fields.

Interiobot.com teck stack on AWS

19 implied HN points • 11 Apr 23

Build your project on AWS for speed and scalability without technical debt.
Utilize AWS Amplify to easily set up your backend, frontend, and data model.
Leverage Step Functions to handle workflows efficiently by combining operations.

Stop the Chaos in Your Data Warehouse with DBT

19 implied HN points • 24 Feb 23

Version control in DBT allows for collaborative work without confusion or errors.
DBT's Jinja templating feature enhances SQL code flexibility and reusability.
DBT's built-in testing framework ensures data pipeline integrity and accuracy.

Data Eng Weekly - Ep 7

19 implied HN points • 18 Feb 23

Snowflake separates data compute and storage, allowing cost-efficient serverless computing
Snowflake offers a marketplace for shared data and future app store integration, transforming data sharing
Consider potential drawbacks like cost control and infrastructure tool limitations when using Snowflake

Build in Kubernetes vs in AWS / Airbyte, the open source ETL

19 implied HN points • 27 Jan 23

Consider managed services like AWS Step Functions to reduce code complexity and maintenance.
Open-source tools like Airbyte can simplify data integration by providing a wide range of pre-built connectors.
Using open-source products can help decentralize maintenance efforts and mitigate risks in data engineering projects.

Design Patterns / Tools / Curated Articles

19 implied HN points • 20 Jan 23

Learn about idempotency in data engineering, such as using the python library aws-lambda-powertools.
Stay informed on the data stack usage trends and challenges from a survey in the SeattleDataGuy's Newsletter.
Discover tools like Plural that automate the deployment of complete data stacks on Kubernetes in your cloud.

AWS Data Eng trick of the week.

19 implied HN points • 06 Jan 23

Coupling an SQS queue with a Lambda function on AWS can help in handling a large number of events and controlling event processing
Processing events from an SQS queue in batch mode can lead to inefficiency if the Lambda function fails, requiring failed events to be reprocessed
Implementing a failure handling strategy for Lambda functions is crucial for maintaining the reliability and efficiency of a data processing pipeline on AWS

From Data Engineer to YAML Engineer

2 HN points • 22 Nov 23

Software development is evolving towards a declarative paradigm in data engineering.
Cloud processing tools like AWS EventBridge are simplifying data pipeline creation.
Data warehouses, like Snowflake, are introducing constructs that streamline data pipeline development.

DBT Labs rate increase

1 HN point • 23 Aug 23

DBT Labs changed their pricing model to limit the number of models built per month instead of being based on the number of users.
The change in pricing strategy was likely influenced by DBT Labs' recent funding round with a valuation of $4.2 billion and the pressure to increase revenue.
Community members are exploring alternative platforms like Airflow, Prefect, and Dagster due to the perceived excessive cost of DBT Labs' new pricing model.

No more painful Lambda Local development: Unboxing the SST Magic!

0 implied HN points • 26 May 23

SST simplifies and streamlines Lambda development process, making it a pleasure to work with compared to other frameworks.
SST allows for Live Lambda Development, enabling local testing of Lambda functions within the AWS environment.
SST combines the simplicity of its constructs with the power and flexibility of AWS CDK, enhancing both high and low levels of abstraction in development.

IaC: No more click click in AWS console

0 implied HN points • 13 Mar 23

Infrastructure as Code (IaC) allows managing cloud assets with code instead of manually through console.
IaC frameworks differ based on factors like open source, versatility, and abstraction levels.
Abstraction in IaC frameworks like Terraform and Serverless can significantly impact the amount of code needed for infrastructure management.

Open Source ChatGPT Clone

0 implied HN points • 18 Mar 23

Stanford released an open-source ChatGPT alternative called Alpaca 7B to democratize large language models
European companies can fine-tune open-source models like Alpaca for specific use-cases and privacy
Alpaca is currently non-commercial but a commercial version may emerge in the future

Data Eng Weekly - Ep 5

0 implied HN points • 06 Feb 23

Event-Driven Architecture decouples components for scalability and fault tolerance.
Amazon EventBridge simplifies event handling by centralizing event routing and processing.
EventBridge Pipes enable direct integration between services without the need for intermediary Lambda functions.