Data Engineering Central

Data Engineering Central is a platform dedicated to discussing the intricacies and best practices of data engineering, including coding languages like Python, SQL, data pipelines, Machine Learning Operations (MLOps), and managing workplace stress. It delves into specific technologies, patterns, and methodologies pivotal in the field, providing insights for both novice and experienced data engineers.

Python Development SQL Indexing Command Line Tools Data Analytics Data Pipelines Machine Learning Operations (MLOps) Large Language Models (LLMs) Data Engineering Tools Workplace Well-being Code Complexity Data Modeling Unit Testing Data Quality Data Structures and Algorithms Cloud Services Feature Stores Data Architecture Career Development

The hottest Substack posts of Data Engineering Central

And their main takeaways

Why Python Always Breaks

609 implied HN points • 19 Jan 24

Python is a versatile language great for rapid iteration, prototyping, and one-off scripting.
Python can be challenging for developers due to pitfalls like lack of strict typing and scoping rules.
Best practices in Python development include clean, maintainable code, thorough testing, and strong peer-review culture for code quality.

Intro to SQL Indexes

589 implied HN points • 17 Jan 24

🕹 Technology Databases Indexes Performance Analytics

Indexes are crucial for improving performance in SQL operations and data access.
Clustered and non-clustered indexes are the two main types to understand in SQL indexing.
Understanding use cases and query access patterns is key to designing effective indexes for data warehouses.

Learning the Command Line

511 implied HN points • 08 Jan 24

🕹 Technology Operating Systems

Learning the command line is still important in the age of cloud computing because it enables faster development and automation.
The command line tools and commands are similar across different operating systems, so focusing on general concepts is more important than specific system knowledge.
Using the command line allows you to work with popular tools like Docker, Kubernetes, and AWS efficiently, making it crucial for engineers in high-performance teams.

Data Warehouse Analytics - Latency

491 implied HN points • 10 Jan 24

🕹 Technology Data Analytics Data Warehousing Latency Big Data

The business still needs dashboards.
A multitude of analytics still need to be calculated.
Analytics are still hard to get right.

Introduction to Write-Audit-Publish Pattern

432 implied HN points • 15 Jan 24

🕹 Technology Data Engineering

The concept of Write-Audit-Publish (WAP) is being discussed for data pipelines.
The post explores whether the WAP pattern is worth implementing and considers alternative approaches.
Data Engineering Central emphasizes reader support for new posts and offers subscription options.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Why DuckDB is losing to Polars

373 implied HN points • 29 Jan 24

🕹 Technology Data Engineering Data Tools Community Data Pipelines Technology Adoption

Technology innovations come from solving problems and gaining popularity.
Community engagement and real-world usage are important factors in tool evaluation.
Polars is gaining traction over DuckDB due to its versatility and widespread adoption.

LLMs Part 2 - Fine Tuning OpenLLaMA

393 implied HN points • 16 Jan 24

🕹 Technology Machine Learning Data Engineering AI GPU Model Training

LLMs require fine-tuning to adapt to specific tasks or styles.
Data Engineers play a vital role in preparing data for LLMs.
Training LLMs involves setting up environments, automating tasks, and requires a lot of data engineering skills.

Are Data Contracts For Real?

294 implied HN points • 05 Feb 24

🕹 Technology Data Engineering Data Contracts APIs Data Quality Data Tools

Data Contracts may not be widely adopted in the data engineering community.
The idea behind Data Contracts is to enforce trustworthiness and consistency in data.
The challenge with Data Contracts seems to be the complexity and adoption of specific technologies.

How to Reduce Complexity

275 implied HN points • 22 Jan 24

🕹 Technology Coding Software Engineering Developers

Complexity can be the enemy of clean code.
Being able to write good code is a valuable skill.
Continuous learning is important in the field of software engineering.

MLOps Basics - For Data Engineers.

393 implied HN points • 15 May 23

🕹 Technology Machine Learning Data Engineering MLOps Training Automation

Working on Machine Learning as a Data Engineer is not as hard as it seems - it falls somewhere in the middle of difficulty.
Machine Learning work for Data Engineers focuses on MLOps like feature stores, model prediction, automation, and metadata storage.
The key aspects of MLOps include automating tasks, using tools like Apache Airflow, and managing metadata for a stable ML environment.

The Truth about Prefect, Mage, and Airflow.

294 implied HN points • 10 Apr 23

🕹 Technology Data Engineering Python Development

Airflow has been a dominant tool for data orchestration, but new tools like Prefect and Mage are challenging its reign.
Prefect focuses on using Python for defining tasks and workflows, but may not offer enough differentiation from Airflow.
Mage stands out for its focus on engineering best practices and providing a smoother developer experience, making it a compelling choice over Airflow for scaling up data pipelines.

Dealing with Stress, Anxiety, and Hardship in the Workplace

275 implied HN points • 05 Jun 23

🏥 Health & Wellness Stress management Workplace Challenges Personal Development Professional Growth Communication Skills

Stress, anxiety, and hardship are common in the workplace, including in Data Teams.
Focus on personal well-being to reduce stress at work: Manage finances, exercise, get fresh air, control news intake, have personal development, and eat better.
Address work-related stress by facing workload, improving communication, and pursuing professional growth and development.

Data Modeling 101 - Part 2

255 implied HN points • 10 Jul 23

🕹 Technology Data Modeling Indexes

Data Modeling involves distinct approaches for relational databases and Lake Houses.
Key concepts like logical normalization, business use case analysis, and physical data localization are crucial for effective data modeling.
Understanding the 'grain' of the data, or the lowest level of detail in a record, is essential for a successful data model.

A Primer on Data Architecture

117 implied HN points • 01 Feb 24

🕹 Technology Data Engineering Architecture

Data architecture is an important topic for data engineers to understand.
Choosing tools like Airflow, Snowflake, and Databricks is not the only approach to data architecture.
Approaching data architecture without a strategic plan can lead to challenges within an organization or team.

Unit Testing for Data Engineers.

216 implied HN points • 13 Feb 23

🕹 Technology Data Engineering Unit Testing SQL Code Refactoring

Data Engineers often struggle with implementing unit tests due to factors like focus on moving fast and historical lack of emphasis on testing.
Unit testable code in data engineering involves keeping functions small, minimizing side effects, and ensuring reusability.
Implementing unit tests can elevate a data team's performance and lead to better software quality and bug control.

The "Brittleness" Problem in Data Pipelines.

157 implied HN points • 24 Apr 23

🕹 Technology Data Engineering Data Pipelines Resources

Brittleness in data pipelines can lead to various issues like data quality problems, difficult debugging, and slow performance.
To overcome brittle pipelines, focus on addressing data quality issues through monitoring, sanity checks, and using tools like Great Expectations.
Development issues such as lack of tests, poor documentation, and bad code practices contribute to brittle pipelines; implementing best practices like unit testing and Docker can help improve pipeline reliability.

DSA For The Rest Of Us - Part 1

157 implied HN points • 13 Mar 23

🕹 Technology Data Engineering Data Structures Software Development Rust Programming

Understanding Data Structures and Algorithms is important for becoming a better engineer, even if you may not use them daily.
Linked Lists are a linear data structure where elements are not stored contiguously in memory but are linked using pointers.
Creating a simple Linked List in Rust involves defining nodes with values and pointers to other nodes, creating a LinkedList to hold these nodes, and then linking them to form a chain.

Rayon in Rust vs Python Process and Thread Pools.

137 implied HN points • 09 Mar 23

🕹 Technology Programming Data processing Performance Python

Introduction to Rayon for parallel data processing in Rust
Comparison of ThreadPools and ProcessPools in Python
Demonstration of performance improvement with Rayon in Rust

Future proof yourself against AI.

137 implied HN points • 20 Mar 23

🕹 Technology AI Data Engineering

Future proof yourself against AI to stay relevant in the changing landscape of software engineering.
There are three types of people when it comes to AI and programming: those who don't use AI and dismiss it, those who use it to enhance their work, and those who rely on it completely and may become less effective engineers.
The impact of AI on software engineering is inevitable and will lead to changes in the field over time.

CELEBRATING 3K+ SUBSCRIBERS!

137 implied HN points • 19 May 23

🕹 Technology Data Engineering

Celebrating reaching 3K+ subscribers
Offering a 50% discount on subscriptions
Encouraging to take a break and enjoy life outside

My Love, Hate Relationship.

137 implied HN points • 24 Jul 23

🕹 Technology Data Engineering AWS Serverless DevOps CI/CD

Data Engineers may have a love-hate relationship with AWS Lambdas due to their versatility but occasional limitations.
AWS Lambdas are under-utilized in Data Engineering but offer benefits like cheap solutions, ease of use, and driving better practices.
AWS Lambdas are handy for processing small datasets, running data quality checks, and executing quick logic while reducing architecture complexity and cost.

SQL Joins + Where Clauses.

137 implied HN points • 03 Jul 23

🕹 Technology Data Engineering

SQL Joins and Where Clauses are important in data processing.
They are like a powerful tool when used together.
Be cautious when combining SQL Joins and Where Clauses to avoid pitfalls.

MLOps 101 - Feature Stores

137 implied HN points • 12 Jun 23

🕹 Technology Data Engineering Machine Learning

Feature Stores are essential in machine learning for managing and serving features.
Feature Stores provide consistency, reusability, efficiency, discoverability, and monitoring benefits.
Popular Feature Store options include Databricks Feature Stores, Feast (open-source), Postgres, DynamoDB, and s3.

The best kept secret to ML success.

117 implied HN points • 17 Apr 23

🕹 Technology Data Engineering Machine Learning

The best secret to ML success is Databricks + Delta Lake.
There's an overload of Machine Learning content, and not all of it is valuable.
Consider a 7-day free trial to learn more about the best-kept secret to ML success.

Most Data Engineers are Mid

2 HN points • 19 Jul 23

🕹 Technology Data Engineering Quality control Team Dynamics

Most Data Engineers tend to be at a mid-level in their career.
Key characteristics to avoid in Data Engineering include never acting, poor quality work, and a lack of continuous learning.
Becoming a Lone Wolf in data engineering can hinder team cohesion and overall efficiency.