The hottest Data Engineering Substack posts right now

And their main takeaways

What is open-source workflow orchestration and when should you move on?

The Orchestra Data Leadership Newsletter • 0 implied HN points • 23 Oct 23

🕹 Technology Data Engineering

Open-source workflow orchestration tools like Apache Airflow have been around for a long time and offer flexibility in developing, scheduling, and monitoring batch-oriented workflows.
Specialized tools are emerging for data operations to improve quality, moving away from the Swiss Army Knife approach of general-purpose orchestration tools.
When considering upgrading from open-source orchestration tools, evaluate if the tool effectively handles monitoring, metadata gathering, and other complex data operation needs; specialized tools may be more suitable in such cases.

Are Lakehouses a joke or is Databricks the endgame??

The Orchestra Data Leadership Newsletter • 0 implied HN points • 19 Oct 23

🕹 Technology Data Engineering

Considering the evolution of data engineering tools and software can be likened to the concept of limits in mathematics, where processes tend to 'streaming' use cases and Lakehouses play a role in this transition.
Databricks, developed by the creators of Apache Spark, excels in loading data from Data Lakes, handling schemas, and treating data sources as streams, making it a valuable tool for data processing.
While Databricks offers advanced capabilities in data ingestion, transformation, and machine learning operations, there may still be a need for custom infrastructure for specific real-time use cases, leading to a nuanced evaluation of tools like Databricks in the data engineering landscape.

GitTrends - May 25 2024

GitTrends • 0 implied HN points • 26 May 24

🕹 Technology Data Engineering

Top trending GitHub repositories cover a wide range of topics from AI, programming languages, UI libraries, search engines, to automation tools and more.
Some repositories, like llama3-from-scratch and geektime-books, showed significant growth in popularity week over week, indicating strong community interest.
The growth rates of various repositories highlight the diverse interests within the GitHub community spanning from large language models, AI applications, development tools, productivity apps, and even anti-bloatware tools.

Gradient Flow #32: Data Cascades, Demand for Data Engineers, Exploiting ML models

Gradient Flow • 0 implied HN points • 08 Apr 21

🕹 Technology Data Engineering

Data quality is essential for great AI products and services, emphasizes the need for tools like Great Expectations for validation and testing.
There is a rising demand for data engineers, illustrated by the funding announcements of Streamlit, Flatfile, and Snorkel.
Exploiting machine learning pickle files is a concern, with an open source tool discussed to reverse engineer and test these files.

16th Minute Newsletter 1.0.0

realkinetic • 0 implied HN points • 24 Jun 24

🕹 Technology Data Engineering

16th Minute newsletter covers a range of tech topics from compound AI systems to data structures.
AI development is shifting towards compound AI systems where operations and systems thinkers play vital roles.
Multi-tenancy in Kubernetes is an important area to explore for those working on enterprise software.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

How to Automate Dataflow Flex-Template Deployments with GitLab CI/CD

realkinetic • 0 implied HN points • 23 Feb 24

🕹 Technology Data Engineering

Approach data engineering like software products, applying software engineering SDLC principles can help automate Google Cloud Dataflow with GitLab CI/CD pipelines.
A Dataflow flex-template consists of a Dockerfile and a template specification JSON file, offering advantages like separating implementation from deployment and enabling different teams to work on the pipeline.
Using GitLab's CI/CD for deploying Dataflow flex-templates is beneficial due to its intuitive UI, CI Linting feature, out-of-the-box security, and environment integration tools.

Data Engineering: A Software Engineer’s Approach

realkinetic • 0 implied HN points • 25 Jan 24

🕹 Technology Data Engineering

The tech industry varies in its expectations of data engineers, leading to challenges in team performance and hiring.
Companies today need to be data-driven, utilizing modern data stack tools, which necessitates a blend of data engineering and software engineering skills.
Data engineering benefits from adopting software engineering principles like treating systems as products, clear communication, and implementing CI/CD pipelines.

ORMs do not ORM

Stateless Machine • 0 implied HN points • 10 Jul 24

🕹 Technology Data Engineering

There’s a debate about whether using an ORM is beneficial or not. Some people think it’s unnecessary and prefer to write SQL directly.
ORMs and raw SQL both try to solve similar problems but don’t actually provide a true 'mapping' between objects and database queries.
Query builders can be a good compromise, allowing easier SQL query creation while helping with the mapping between database and code.

AGI vs human-level AI, Analytics IPOs & Russia's AI-powered artillery 🔫🎖

Sector 6 | The Newsletter of AIM • 0 implied HN points • 30 Jan 22

🕹 Technology Data Engineering

AGI, or Artificial General Intelligence, is different from human-level AI. AGI aims to understand and learn any task just like a human, while human-level AI is designed for specific tasks.
Data engineering is becoming increasingly important for organizations to improve their data workflows. Efficient data handling can help businesses make better decisions.
Russia is using AI in its military applications, such as artillery. This shows how AI technology is being integrated into various sectors, including defense.

Data Engineer Salaries, QSim & Databricks 🗿📡🔮

Sector 6 | The Newsletter of AIM • 0 implied HN points • 05 Sep 21

🕹 Technology Data Engineering

Data engineer salaries are important to know if you're looking to enter this field. They can vary widely based on experience and location.
QSim is a tool that helps manage and analyze data efficiently. It's helpful in making data-driven decisions.
Databricks is a popular platform for data engineering that makes collaboration easier. It helps teams work together on large datasets.

Intel Outside, Become a Data Engineer & free access to courses

Sector 6 | The Newsletter of AIM • 0 implied HN points • 07 Feb 21

🕹 Technology Data Engineering

The Machine Learning Developers Summit is happening soon and will attract over 1000 participants.
The event includes tech talks, workshops, paper presentations, and a hackathon.
Sponsors play a big role in supporting initiatives that benefit the community.

Introducing The Beep

The Beep • 0 implied HN points • 01 Jan 24

🕹 Technology Data Engineering

The Beep is a newsletter about data technology and artificial intelligence. It aims to provide quality insights rather than just news and jargon.
The authors plan to cover a variety of topics, including large language models and image generation, with a mix of concepts, tutorials, and best practices.
Subscribers can choose between free and paid options, with paid subscribers getting full access to all content and tutorials with coding support.

DLD #1 | Data Landscape Digest 🗞️

Practical Data Engineering Substack • 0 implied HN points • 25 Aug 24

🕹 Technology Data Engineering

Data engineering is evolving rapidly, and staying updated on new tools and technologies is important for success in the field.
Mastering the fundamentals, like SQL and Python, is crucial as they form the foundation for using advanced tools effectively.
Open source solutions, like Apache Hudi and XTable, are gaining popularity and can provide great benefits for managing data efficiently.

Managing Dependencies Between Data Pipelines

Practical Data Engineering Substack • 0 implied HN points • 26 Aug 23

🕹 Technology Data Engineering

Managing dependencies between data pipelines is crucial for ensuring that upstream tasks are completed before downstream tasks start. This avoids issues with incomplete or faulty data.
There are different techniques to manage these dependencies, ranging from simple time-based scheduling to more complex orchestrations that adjust based on the successful completion of previous tasks.
Choosing the right method for managing pipeline dependencies depends on the complexity of the data workflows and the need for independence between different teams and tasks.

Internal Storage Design of Modern Key-value Database Engines [Part 4]

Practical Data Engineering Substack • 0 implied HN points • 19 Aug 23

🕹 Technology Data Engineering

LSM-Trees are designed to improve the performance of key-value databases, especially for write operations, but they can struggle with reading data quickly.
Innovations like separating keys from values in storage models, like WiscKey, help reduce I/O overhead and improve speed, particularly when using SSDs.
Using multi-channel SSDs can further boost performance for LSM-Trees, allowing for faster data processing and better overall efficiency.

[in case you missed it] Data Science Weekly - Issue 470

Data Science Weekly Newsletter • 0 implied HN points • 27 Nov 22

🕹 Technology Data Engineering

Recommender systems often focus on increasing user engagement, but this can lead to unintended negative effects like addiction. A new understanding of user preferences could help create better recommendations.
GitLab's Data Team Handbook shares valuable information on how data is used in various business functions. It's organized into helpful sections that explain dashboards, team operations, and current projects.
Deep learning is being used to test video games like Candy Crush for more human-like gameplay. This approach is explored by researchers from gaming companies, highlighting the potential for better game design.

[in case you missed it] Data Science Weekly - Issue 452

Data Science Weekly Newsletter • 0 implied HN points • 24 Jul 22

🕹 Technology Data Engineering

Data scientists are still in demand and well-paid, with job growth expected to continue into the future.
Large Language Models (LLMs) are playing a big role in innovation and are becoming a part of everyday life.
There's a growing need for domain experts in deep learning, allowing more people without advanced degrees to contribute to the field.

[In case you missed it] Data Science Weekly - Issue 450

Data Science Weekly Newsletter • 0 implied HN points • 10 Jul 22

🕹 Technology Data Engineering

AI forecasting contests are being used to predict future progress in AI, showing how forecasts can be evaluated based on actual results.
The demand for analytics engineers is growing, shifting from a less desirable role to one of great interest in the job market.
A new multilingual translation model called NLLB-200 helps translate between 200 low-resource languages, making high-quality translation more accessible.

[in case you missed it] Data Science Weekly - Issue 448

Data Science Weekly Newsletter • 0 implied HN points • 26 Jun 22

🕹 Technology Data Engineering

Machine learning can help the IRS by better analyzing the large amount of tax data they collect, making tax enforcement more effective.
New models like Denoising Diffusion Probabilistic Models are showing great promise in generating high-quality images and audio from simpler inputs.
There is a focus on improving machine learning practices, such as being careful with training data and understanding how to boost model performance through proper methods.

[in case you missed it ] Data Science Weekly - Issue 436

Data Science Weekly Newsletter • 0 implied HN points • 03 Apr 22

🕹 Technology Data Engineering

Aggregating data too much can hide important details. It's better to keep the complexity to find new insights.
Waymo is testing fully autonomous cars in San Francisco. This shows how self-driving technology is becoming part of everyday life.
Graph Neural Networks can handle missing information in data efficiently. They help make better use of connected data even when some details are missing.

[in case you missed it] Data Science Weekly - Issue 416

Data Science Weekly Newsletter • 0 implied HN points • 14 Nov 21

🕹 Technology Data Engineering

ML platforms are crucial for turning models into valuable tools, and each tech company has its own approach and tools to integrate machine learning effectively.
While Kubernetes has advantages for managing data engineering, it's not always necessary and can be frustrating for engineers just wanting to help the business use data better.
New large language models are emerging, making GPT-3 less unique; people are working on creating similar models that could soon be available.

[in case you missed it] Data Science Weekly - Issue 361

Data Science Weekly Newsletter • 0 implied HN points • 25 Oct 20

🕹 Technology Data Engineering

Data infrastructure is becoming more complex, focusing on how data is analyzed rather than just the software. It's important to understand the latest technologies and best practices in this area.
Many companies are using AI but only a small number see a real return on their investment. It's crucial to examine why some businesses succeed with AI while others struggle.
Machine learning models need to be effectively put into production to solve real problems. Deployment is just as important as building the model itself.

[in case you missed it] Data Science Weekly - Issue 333

Data Science Weekly Newsletter • 0 implied HN points • 12 Apr 20

🕹 Technology Data Engineering

Data science often doesn't meet expectations in the workplace due to misunderstandings about its role and challenges like lack of leadership and unclear impact.
Monitoring machine learning models in production is complex but important, and there are practical ways to start effectively tracking their performance.
Building effective data science platforms requires understanding the needs of data scientists to enhance collaboration and address the limits of local development.

GroupBy #24: Enabling near real-time data analytics on the data lake at Grab, Aligning Velox and Apache Arrow at Meta.

VuTrinh. • 0 implied HN points • 27 Feb 24

🕹 Technology Data Engineering

Grab is working on letting users analyze data quickly with their new approach to data lakes. This helps businesses get insights much faster.
Meta is aligning Velox and Apache Arrow to improve data management. This should make it easier to handle and analyze large amounts of data.
PayPal is using Spark 3 and NVIDIA's GPUs to cut their cloud costs by up to 70%. This helps them process a lot of data without spending too much money.

GroupBy #22: Data Engineering Landscape in 2024, how I scaled my $1m/year revenue startup's data model

VuTrinh. • 0 implied HN points • 13 Feb 24

🕹 Technology Data Engineering

The data engineering field is evolving, and it's important to understand the upcoming trends that will impact how we work with data.
Creating a simple and efficient data model is key for startups, but as they grow, it's crucial to adapt and scale the data model to meet new demands.
Learning SQL remains essential, as it is still a fundamental tool in data manipulation, making it important for anyone in the data field to master.

GroupBy #21: How to design resilient and large scale data systems, What Data Modeling is NOT

VuTrinh. • 0 implied HN points • 06 Feb 24

🕹 Technology Data Engineering

Designing data systems requires resilience and scalability, which means they should handle growth and failures efficiently.
Data modeling is more than just making diagrams; it's about understanding the entire system and how data flows within it.
Using tools like DuckDB in the browser can open up new possibilities for data processing, making it more accessible and flexible.

GroupBy #19: How Apple built iCloud to store billions of databases, Palette-Uber feature store, Definition of Data Modeling

VuTrinh. • 0 implied HN points • 23 Jan 24

🕹 Technology Data Engineering

Apple uses special databases like Cassandra and FoundationDB to manage iCloud's huge storage system. This helps them keep track of billions of databases effectively.
Uber created a feature store called Palette that helps in managing data for machine learning projects. It collects and organizes useful features for easy access by developers.
Data modeling is a key concept that defines how data is organized and related in a system. Different experts might have varying definitions, showing the complexity of the topic.

GroupBy #15: How Meta built the infrastructure for Threads, Notion's data scale journey

VuTrinh. • 0 implied HN points • 26 Dec 23

🕹 Technology Data Engineering

Meta created a strong infrastructure for Threads to handle massive user growth right after its launch. This enabled over 100 million sign-ups in just five days.
Notion's data infrastructure had to evolve to keep up with its rapid growth and new product uses. This involved significant changes to manage their increasing data scale.
The 'Grokking Concurrency' book is a helpful resource for learning about concurrent programming. It makes complex topics easier to understand with clear examples.

GroupBy #11: Python at Meta, Netflix Incremental Processing with Apache Iceberg, 2023 AI year in brief

VuTrinh. • 0 implied HN points • 28 Nov 23

🕹 Technology Data Engineering

Meta is working on improving how developers use Python, making it smoother with better tools like a new linter.
Netflix has built a system for processing data incrementally using Apache Iceberg, which helps manage and update data efficiently.
There are free courses available from Microsoft and Google Cloud that teach the basics of Generative AI, helping anyone to get started in this exciting field.

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

VuTrinh. • 0 implied HN points • 21 Nov 23

🕹 Technology Data Engineering

Netflix's Psyberg is a new way for processing data that helps manage membership information better. It uses innovative methods to make data processing more efficient.
The Parquet format is great for storing data because it organizes information in a smart way. It can improve how quickly and easily data is accessed and processed.
SQL isn't the best tool for doing analytics because it was designed a long time ago. There are newer tools that fit analytics needs much better.

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

VuTrinh. • 0 implied HN points • 14 Nov 23

🕹 Technology Data Engineering

The FDAP stack is important in building reliable data systems. It helps to manage data more efficiently by using advanced technologies.
Learning about data quality is crucial. It ensures that the information used for decision-making is accurate and trustworthy.
Data-driven management is all about making decisions based on solid data insights. It helps businesses understand what works and what doesn't.

GroupBy #8: Demystifying the Parquet File, the future of the data engineer, intro to data modeling.

VuTrinh. • 0 implied HN points • 06 Nov 23

🕹 Technology Data Engineering

The Parquet file format is becoming popular for data storage because it is efficient and works well with big data tools. Understanding how to use it can help data engineers be more effective.
Data engineering is evolving, and new trends like data mesh are changing how data platforms are built. Keeping up with these changes is important for anyone in the field.
Starting a small data engineering project can be a great way to learn new skills. Even a quick project can teach you important techniques, like web scraping and using cloud storage.

GroupBy #4: Polars and Pandas, 1.8 trillion events, data quality

VuTrinh. • 0 implied HN points • 10 Oct 23

🕹 Technology Data Engineering

Polars and Pandas are tools for data processing, but they have different performance levels. Understanding when to use each can help manage large datasets better.
Data quality is crucial for successful data engineering. Companies like Google and Uber have strategies in place to ensure their data is accurate and reliable.
Learning SQL execution order can really help in data tasks. It outlines the steps SQL takes to process a query, which is key for optimizing database interactions.

GroupBy #3

VuTrinh. • 0 implied HN points • 22 Sep 23

🕹 Technology Data Engineering

Docker commands can be simplified with a cheat sheet, making it easier for developers to use container technologies effectively.
Apache Spark was created at UC Berkeley to improve cluster computing, focusing on faster interactive computations than previous systems like Hadoop.
There are key differences between HDFS and S3, especially in how they handle data, and many people confuse them even though they serve different purposes.

GroupBy #2

VuTrinh. • 0 implied HN points • 15 Sep 23

🕹 Technology Data Engineering

The Lakehouse concept combines the best features of data lakes and data warehouses. It's a new way to manage and analyze data effectively.
Good data quality is essential for making AI work. If the data is bad, the results will also be poor.
AI tools might help data teams work more efficiently, but they won't reduce the demand for data professionals. In fact, they might increase it.

Mastering Apache Spark Performance: A Data Engineer's Guide to Optimization

DataSketch’s Substack • 0 implied HN points • 14 Oct 24

🕹 Technology Data Engineering

Properly configuring resources in Spark is really important. Make sure you adjust settings like memory and cores to fit your cluster's total resources.
Good data partitioning helps Spark job performance a lot. For example, repartitioning your data based on a relevant column can lead to faster processing times.
Using broadcast joins can save time and reduce workload. When joining smaller tables, broadcasting can make the process much quicker.

Mastering Window Functions: Your Gateway to Advanced SQL Analytics

DataSketch’s Substack • 0 implied HN points • 07 Oct 24

🕹 Technology Data Engineering

Window functions let you do calculations across rows related to your current row without losing any details. This helps you get both summarized and detailed data at the same time.
Using window functions can make complex data tasks easier, like ranking items or finding running totals. They are very helpful in fields like healthcare to analyze patient data and improve efficiency.
It's important to test how window functions perform on a smaller dataset before using them widely. Combining multiple window functions and partitioning your data smartly can also boost performance.

Harnessing Data Architecture: Practical Models and SQL Solutions

DataSketch’s Substack • 0 implied HN points • 26 Mar 24

🕹 Technology Data Engineering

Creating effective data models is crucial for businesses to organize and use their data efficiently.
Different industries like eCommerce, healthcare, and retail have unique data needs that can be addressed with tailored database solutions.
Understanding SQL and how to create tables and relationships helps in developing strong data architecture.

Data Modeling for Data Engineering

DataSketch’s Substack • 0 implied HN points • 18 Mar 24

🕹 Technology Data Engineering

Data modeling is like creating a map for organizing and finding data easily. It helps keep everything tidy and accessible.
There are three types of data models: conceptual, logical, and physical, each serving different levels of detail in planning data structure.
A practical example is organizing a library, where the models help define books, authors, and loans, ensuring everything links and works smoothly.

Dataflow 101: Exploring Essential Modes for Efficient Applications

DataSketch’s Substack • 0 implied HN points • 13 Feb 24

🕹 Technology Data Engineering

Databases are key for storing and managing data, supporting both everyday transactions and complex analysis. Using them effectively helps data engineers connect different platforms and applications.
Different data transfer methods, like REST and RPC, help systems communicate efficiently, just like a well-organized library or a quick phone call. Choosing the right method depends on the speed and precision needed for the task.
Message-passing systems allow for flexible and real-time data processing, making them great for applications like IoT or e-commerce. They help ensure communications between services happen smoothly and reliably.