The hottest Data Engineering Substack posts right now

And their main takeaways

Should You Build a Custom Data Orchestration System? Here’s What to Consider

SeattleDataGuy’s Newsletter • 317 implied HN points • 23 Oct 24

🕹 Technology Data Engineering

Building your own data orchestration system can lead to many challenges, like handling dependencies and scheduling tasks correctly. It's important to think if it's really necessary or if existing tools will work better.
A custom orchestrator needs to manage various functions like logging, alerting, and integrating with other tools. Without proper features, it can become complex and hard to maintain.
Before you decide to create your own solution, consider what makes it different and better than what's already available. Make sure to also think about how you’ll get people to use your new system.

Data Science Weekly - Issue 486

Data Science Weekly Newsletter • 359 implied HN points • 17 Mar 23

🕹 Technology Data Engineering

AI and data science are evolving rapidly, making it challenging for many to keep up. It's common for professionals to feel overwhelmed as they try to understand new advancements.
There's a growing discussion about whether we should slow down AI development. Some people believe we need to pause and figure out the implications of current technologies before moving forward.
Many professionals are exploring career shifts between data science and data engineering. It's important to consider personal interests and skills when deciding which path to take.

Data Science Weekly - Issue 565

Data Science Weekly Newsletter • 1 HN point • 19 Sep 24

🕹 Technology Data Engineering

Reading The Data Science Weekly is a great way to stay updated on AI and machine learning topics. It shares links, news, and resources that can help anyone interested in these fields.
There are many useful techniques in data science, like the Hampel Filter for outlier detection, which can help improve data quality. Exploring these methods can really enhance your understanding and skills.
Effective communication is crucial in data science. How you explain your findings can significantly impact your career, so it's important to work on your communication skills.

Why Apache Iceberg is heralding a new era of change in Data Engineering

The Orchestra Data Leadership Newsletter • 59 implied HN points • 20 Mar 24

🕹 Technology Data Engineering

Apache Iceberg introduces Bring Your Own Storage (BYOS) concept, which is gaining popularity for efficient and reliable data management in distributed environments.
Key features of Apache Iceberg include Atomic Transactions, Schema Evolution, Partitioning and Sorting, Time Travel, Incremental Data Updates, Metadata Management, and Compatibility with various data processing frameworks.
Platforms like Snowflake are shifting towards supporting Iceberg due to its benefits in handling data efficiently and enabling a Bring Your Own Storage pattern.

I spent another 6 hours understanding the design principles of Snowflake. Here's what I found

VuTrinh. • 79 implied HN points • 10 Feb 24

🕹 Technology Data Engineering

Snowflake separates storage and compute, allowing for flexible scaling and improved performance. This means that data storage can grow separately from computing power, making it easier to manage resources.
Data can be stored in a cloud-based format that supports both structured and semi-structured data. This flexibility allows users to easily handle various data types without needing to define a strict schema.
Snowflake implements unique optimization techniques, like data skipping and a push-based query execution model, which enhance performance and efficiency when processing large amounts of data.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Data Science Weekly - Issue 503

Data Science Weekly Newsletter • 219 implied HN points • 14 Jul 23

🕹 Technology Data Engineering

Machine learning is making its way into finance, and researchers are identifying practical uses for it. This can help finance professionals learn new tools and statisticians find interesting financial problems to solve.
AI platforms, like social media, are becoming crucial in our lives but can be confusing and unreliable. People are figuring out how to use these platforms effectively despite their unpredictability.
Large language models are changing how data scientists work. These models can automate many tasks, allowing data scientists to focus on managing and assessing the AI's outputs.

The "Brittleness" Problem in Data Pipelines.

Data Engineering Central • 157 implied HN points • 24 Apr 23

🕹 Technology Data Engineering

Brittleness in data pipelines can lead to various issues like data quality problems, difficult debugging, and slow performance.
To overcome brittle pipelines, focus on addressing data quality issues through monitoring, sanity checks, and using tools like Great Expectations.
Development issues such as lack of tests, poor documentation, and bad code practices contribute to brittle pipelines; implementing best practices like unit testing and Docker can help improve pipeline reliability.

DSA For The Rest Of Us - Part 1

Data Engineering Central • 157 implied HN points • 13 Mar 23

🕹 Technology Data Engineering

Understanding Data Structures and Algorithms is important for becoming a better engineer, even if you may not use them daily.
Linked Lists are a linear data structure where elements are not stored contiguously in memory but are linked using pointers.
Creating a simple Linked List in Rust involves defining nodes with values and pointers to other nodes, creating a LinkedList to hold these nodes, and then linking them to form a chain.

Data Science Weekly - Issue 505

Data Science Weekly Newsletter • 199 implied HN points • 28 Jul 23

🕹 Technology Data Engineering

Large language models use complex methods like word vectors and transformers to understand language, but this can be explained simply without heavy math. They need a lot of data to perform well.
Using AI tools like ChatGPT for real-world programming tasks can streamline the coding process, as it allows for a more focused workflow without switching between different resources.
Building effective data storage systems, like Amazon S3, involves overcoming interesting challenges and nuances, demonstrating the amazing technology behind big data management.

Data Science Weekly - Issue 489

Data Science Weekly Newsletter • 299 implied HN points • 06 Apr 23

🕹 Technology Data Engineering

Understanding linear programming can help solve complex problems using Python. It's useful in various fields and can optimize outcomes.
MLOps is closely related to data engineering, showing that managing data for machine learning involves more engineering than initially thought.
The new pandas 2.0 version has exciting features like the Apache Arrow backend, which will enhance its performance and capabilities.

Data Engineering Vs Machine Learning Pipelines

SeattleDataGuy’s Newsletter • 1048 implied HN points • 11 Apr 23

🕹 Technology Data Engineering

Data engineering and machine learning pipelines are essential components for every company, but are often confused because they have different objectives.
Data engineering pipelines involve data collection, cleaning, integration, and storage, while machine learning pipelines focus on data cleaning, feature engineering, model training, evaluation, registry, deployment, and monitoring.
Both data and ML pipelines require careful consideration of computational needs to handle sudden changes, and understanding the differences between them is important for effective data processing and decision-making.

Data Science Weekly - Issue 485

Data Science Weekly Newsletter • 319 implied HN points • 09 Mar 23

🕹 Technology Data Engineering

The newsletter shares interesting links about data science, machine learning, and AI each week. It’s a good way to keep up with new trends and knowledge in the field.
There's a discussion on what databases should do but often don’t. Understanding these gaps can help you improve your data projects by knowing what to build yourself.
AI's impact on jobs and industries is being researched, especially how language models like ChatGPT could change certain occupations. It's important to understand how AI can affect your career choices.

Data Science Weekly - Issue 500

Data Science Weekly Newsletter • 219 implied HN points • 23 Jun 23

🕹 Technology Data Engineering

AI technology is advancing quickly and can even cover public meetings, but we need to think carefully about its readiness for everyday use.
Engineers can improve their people skills and interactions by applying the same problem-solving mindset they use in their technical work.
Generative AI is becoming important in data science for creating synthetic data, which helps in privacy and enhances analysis without losing useful information.

Data Science Weekly - Issue 499

Data Science Weekly Newsletter • 219 implied HN points • 16 Jun 23

🕹 Technology Data Engineering

Using large language models can help kids learn to ask curious questions by automating the teaching process.
New techniques for 3D space reconstruction can make indoor views on platforms like Google Maps look more realistic and interactive.
There's a growing need to understand the value of personal data in online shopping, especially as new regulations come into play.

The hidden costs of pre-computing data | Chalk's Elliot Marx

Dev Interrupted • 14 implied HN points • 09 Dec 25

🕹 Technology Data Engineering

Pre-computing and storing large volumes of derived data wastes money and adds latency because most of it is never used. Shifting to real-time, incremental pipelines means you only compute what users actually need.
Owning the full stack (hardware, training, and cloud) creates a competitive moat and can change leaderboard dynamics quickly. Design your systems to be model-agnostic and flexible so you don’t get locked into one provider.
Typical engineering metrics like velocity or lines of code are often misleading; measure what exposes real friction, bottlenecks, and business outcomes. Use metrics to make the system legible and actionable, not just to produce executive reports.

Future proof yourself against AI.

Data Engineering Central • 137 implied HN points • 20 Mar 23

🕹 Technology Data Engineering

Future proof yourself against AI to stay relevant in the changing landscape of software engineering.
There are three types of people when it comes to AI and programming: those who don't use AI and dismiss it, those who use it to enhance their work, and those who rely on it completely and may become less effective engineers.
The impact of AI on software engineering is inevitable and will lead to changes in the field over time.

CELEBRATING 3K+ SUBSCRIBERS!

Data Engineering Central • 137 implied HN points • 19 May 23

🕹 Technology Data Engineering

Celebrating reaching 3K+ subscribers
Offering a 50% discount on subscriptions
Encouraging to take a break and enjoy life outside

My Love, Hate Relationship.

Data Engineering Central • 137 implied HN points • 24 Jul 23

🕹 Technology Data Engineering

Data Engineers may have a love-hate relationship with AWS Lambdas due to their versatility but occasional limitations.
AWS Lambdas are under-utilized in Data Engineering but offer benefits like cheap solutions, ease of use, and driving better practices.
AWS Lambdas are handy for processing small datasets, running data quality checks, and executing quick logic while reducing architecture complexity and cost.

SQL Joins + Where Clauses.

Data Engineering Central • 137 implied HN points • 03 Jul 23

🕹 Technology Data Engineering

SQL Joins and Where Clauses are important in data processing.
They are like a powerful tool when used together.
Be cautious when combining SQL Joins and Where Clauses to avoid pitfalls.

MLOps 101 - Feature Stores

Data Engineering Central • 137 implied HN points • 12 Jun 23

🕹 Technology Data Engineering

Feature Stores are essential in machine learning for managing and serving features.
Feature Stores provide consistency, reusability, efficiency, discoverability, and monitoring benefits.
Popular Feature Store options include Databricks Feature Stores, Feast (open-source), Postgres, DynamoDB, and s3.

Normalization Vs Denormalization - Taking A Step Back

SeattleDataGuy’s Newsletter • 612 implied HN points • 21 Nov 23

🕹 Technology Data Engineering

Normalization structures data to reduce duplication and ensure integrity.
Goals of normalization include eliminating redundancy, minimizing data mutation issues, and protecting data integrity.
Denormalization introduces redundancy strategically to improve read performance, useful for reporting, analytics, and read-heavy applications.

GroupBy #30: Uber- How LedgerStore Supports Trillions of Indexes, Composable Data Systems: Lessons from Apache Calcite Success

VuTrinh. • 39 implied HN points • 09 Apr 24

🕹 Technology Data Engineering

LedgerStore at Uber can handle trillions of indexes, making it a powerful tool for managing large-scale data efficiently.
Apache Calcite helps build flexible data systems with strong query optimization features, which are vital for many data applications.
Spotify's data platform plays a critical role in their operations, guiding how to build effective data systems in organizations.

People Without Dirty Hands are Wrong

Sung’s Substack • 79 implied HN points • 02 Jan 24

🕹 Technology Data Engineering

Having dirty hands from diving into actual projects is important for growth, rather than just focusing on certifications or theory.
Solving real problems in public and getting your hands dirty in open source can have a significant impact on your career, surpassing the importance of certifications.
Engaging in hands-on experience and collaborating on projects that matter can lead to valuable personal growth and career advancement.

Becoming A Better Data Engineer - Tips On Translating Business Requirements

SeattleDataGuy’s Newsletter • 671 implied HN points • 24 Aug 23

🕹 Technology Data Engineering

Understand the business requirements before building technical solutions.
Ask lots of questions to clarify what the business really needs.
Create visuals or prototypes to better communicate and iterate on business requirements.

Data Science Weekly - Issue 501

Data Science Weekly Newsletter • 179 implied HN points • 30 Jun 23

🕹 Technology Data Engineering

Data scientists are sharing tips on how to make their scientific data more accessible and useful. This helps others to understand and use the data better.
There are many discussions happening about the benefits and drawbacks of large language models (LLMs) like ChatGPT. Some people believe they are amazing, while others think they aren't very helpful.
Naming things in programming can be tough, but there are resources and books that can help. Learning the right naming conventions can improve coding practices.

The Future of Search and How You Can Shape It

Gradient Flow • 199 implied HN points • 23 Feb 23

🕹 Technology Data Engineering

The blend of artificial intelligence and chatbot interfaces, like seen in ChatGPT, is transforming search applications, with startups emphasizing large language models for better search experiences.
Expectations around user interactions with company websites are changing with the rise of chatbot-equipped search engines, requiring integration of AI and foundation models for improved responses incorporating text, images, videos, and audio.
Data and AI teams are crucial in developing, testing, and maintaining next-generation search applications, with companies likely seeking more control over their data and the potential creation of custom models for enhanced privacy and innovation.

The best kept secret to ML success.

Data Engineering Central • 117 implied HN points • 17 Apr 23

🕹 Technology Data Engineering

The best secret to ML success is Databricks + Delta Lake.
There's an overload of Machine Learning content, and not all of it is valuable.
Consider a 7-day free trial to learn more about the best-kept secret to ML success.

Why Low-Code/No-Code Tools Accelerate Risk

SeattleDataGuy’s Newsletter • 412 implied HN points • 07 Mar 24

🕹 Technology Data Engineering

Low-code solutions can speed up workflow development but also the creation of mistakes.
Adopting developer processes for low-code projects can enhance their effectiveness.
Low-code workflows impact the overall system and require considerations for maintenance, monitoring, and future implications.

Flow State Data Engineering: I want it to be normal

Sung’s Substack • 139 implied HN points • 14 Mar 23

🕹 Technology Data Engineering

Data engineering involves many tedious tasks and manual checks, hindering the ability to reach a state of flow
Software engineers have smoother workflows and better tools compared to data engineers, allowing them to focus on their work and enjoy the process
There is potential to improve the data engineering workflow by implementing real-time monitoring, interactive previews, and streamlined processes to enhance the experience

GroupBy #26: How GitHub uses merge queue to ship hundreds of changes every day, Data governance in the age of generative AI, "Good Enough" Data Models

VuTrinh. • 39 implied HN points • 12 Mar 24

🕹 Technology Data Engineering

GitHub uses a merge queue system that helps them quickly ship many code changes each day. This makes their deployment process faster and more efficient.
Data governance is becoming really important, especially with the rise of generative AI. Companies need to ensure the data used by these systems is accurate and secure.
The idea of 'Good Enough' data models suggests that it's okay to have models that meet basic needs instead of striving for perfection. This approach can save time and resources.

You don't know this for sure: How BigQuery stores semi-structured data?

VuTrinh. • 59 implied HN points • 13 Jan 24

🕹 Technology Data Engineering

BigQuery uses a method called definition and repetition level for efficient storage of nested and repeated data. This allows for reading specific parts of data without needing to access other related data.
In columnar storage, data is organized by columns which can improve performance, especially for analytical queries, because only the needed columns are loaded.
Using this method might increase file sizes due to redundancy, but it helps reduce the input/output operations needed when accessing nested fields.

Why you need a Data Catalog to build Data Products

The Orchestra Data Leadership Newsletter • 79 implied HN points • 26 Nov 23

🕹 Technology Data Engineering

Data catalogs are not just for enterprises but also benefit startups by driving business value.
Data catalogs help organizations manage and present their data assets in a user-friendly way for better adoption and value extraction.
Using data catalogs can simplify data access, encourage collaboration between technical and business users, and potentially enhance BI functionalities within organizations.

Research -> Reality. AI Engineers.

potentialmind • 19 implied HN points • 18 May 24

🕹 Technology Data Engineering

The demand for AI Engineers is skyrocketing due to advancements in AI, making it a high-demand engineering job of the decade.
To excel in AI Engineering, practical knowledge and hands-on experience are prioritized over traditional academic qualifications like PhDs or specific courses like PyTorch.
Modern applied AI is changing the landscape, making it easier for software engineers and product managers to leverage large language models and AI frameworks without extensive data collection.

Becoming a Data Engineering Force Multiplier

SeattleDataGuy’s Newsletter • 671 implied HN points • 23 Apr 23

🕹 Technology Data Engineering

Data engineering is crucial in today's data-driven landscape, with a growing demand for skilled professionals.
Developing technical skills like architecture, data modeling, coding, testing, and CI/CD is essential for becoming a successful data engineer.
Non-technical skills such as teaching, long-term project planning, and communication are equally important for data engineers to excel and become force multipliers.

Using The Cloud As A Data Engineer

SeattleDataGuy’s Newsletter • 365 implied HN points • 09 Feb 24

🕹 Technology Data Engineering

Cloud service providers like AWS offer various services for data engineers and scientists.
Lambdas and serverless functions in AWS can automate tasks without complex data pipelines.
Amazon Athena provides serverless querying capabilities over data stored in Amazon S3.

GroupBy #33: Data Gateway - A Platform for Growing and Protecting the Data Tier at Netflix, The Cloud Storage Triad: Latency, Cost, Durability

VuTrinh. • 19 implied HN points • 30 Apr 24

🕹 Technology Data Engineering

Netflix has created a platform called Data Gateway that helps their developers manage data more easily. It simplifies complex database processes so that app developers can focus on coding.
The cloud storage triad talks about balancing latency, cost, and durability when storing data. Choosing the right storage solution can save money while ensuring data is always available.
Managing data ingestion effectively is crucial for companies like RevenueCat. They faced challenges moving their data and found ways to optimize the process for better performance.

I'm Launching My Newsletter: The Tech Buffet!

The Tech Buffet • 79 implied HN points • 01 Sep 23

🕹 Technology Data Engineering

The Tech Buffet is a new newsletter focused on Machine Learning, Data Engineering, and Python Programming. It's designed to help people learn and improve their technical skills.
You can expect weekly updates with practical advice, tutorials, and insights on making machine learning systems more efficient and effective.
The creator wants feedback on what topics readers are interested in, so it's a community-driven project that aims to meet the needs of its audience.

Using AI for Data Modeling in dbt

Inside Data by Mikkel Dengsøe • 41 implied HN points • 04 Jul 25

🕹 Technology Data Engineering

You can use AI to improve data modeling by cleaning raw data and structuring it effectively with tools like dbt. This makes your data easier to work with and analyze.
Creating a good project structure from the start helps manage your data models better and prevents unnecessary refactoring later on. It's smart to plan how your project might grow.
Using AI can save a lot of time in documenting and describing your data models. It helps automatically add useful descriptions, making it quicker to understand your data and its components.

Tracking/Measurement/Collection/Creation - what was the question again?

timo's substack • 78 implied HN points • 26 Mar 23

🕹 Technology Data Engineering

Finding a niche involves identifying what you enjoy and what is consistently needed in your projects.
Tracking data is easily understood, but may have a negative reputation due to its association with web tracking practices.
Measurement is a broader term than tracking, and data collection is often overlooked in the data engineering process.

GroupBy #32: Canva - Scaling to Count Billions, Ensuring Precision and Integrity: A Deep Dive into Uber’s Accounting Data Testing Strategies

VuTrinh. • 19 implied HN points • 23 Apr 24

🕹 Technology Data Engineering

Canva's usage of creator content has skyrocketed, with data showing its growth doubling every 18 months. Managing the architecture to track this data is a significant challenge.
Uber has developed strong testing and monitoring processes for its financial accounting data. This ensures accuracy and presents reliable external financial reports.
With the rise of data lakehouses, utilizing tools like Apache Hudi and Paimon can enhance data storage and performance. These tools help build efficient and scalable data solutions.