The hottest Data Engineering Substack posts right now

And their main takeaways
Category
Top Technology Topics
SeattleDataGuy’s Newsletter 694 implied HN points 14 Feb 24
  1. To grow from mid to senior level, it's important to continuously learn and improve, share new knowledge, work on code improvements, and become an expert in a certain domain.
  2. Making the team better is crucial - focus on mentoring, sharing knowledge, and creating a positive team environment. Think beyond individual tasks to impact the overall team outcomes.
  3. Seniority includes building not just technical solutions, but solutions that customers love. Challenge requirements, understand the business and product, and take initiative in problem-solving.
SeattleDataGuy’s Newsletter 1165 implied HN points 02 Jan 24
  1. Breaking into data engineering may be easier through lateral moves, like from data analyst to data engineer.
  2. The 100-day plan discussed is not meant to master data engineering but to help commit to learning and identify areas for improvement.
  3. The plan includes reviewing basics, diving deeper, building a mini project, surveying tools, best practices, and committing to a final project.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
SeattleDataGuy’s Newsletter 612 implied HN points 21 Nov 23
  1. Normalization structures data to reduce duplication and ensure integrity.
  2. Goals of normalization include eliminating redundancy, minimizing data mutation issues, and protecting data integrity.
  3. Denormalization introduces redundancy strategically to improve read performance, useful for reporting, analytics, and read-heavy applications.
nick’s datastack 1 HN point 24 Apr 24
  1. Generative AI can generate data, impacting workflows and pipelines significantly.
  2. Using LLMs for prompt-based feature engineering can save time and effort compared to traditional methods like manual data searching and merging.
  3. While LLMs in data pipelines may feel magical, it's important to be cautious of potential inaccuracies due to the probabilistic nature of AI outputs.
High ROI Data Science 294 implied HN points 10 Jan 24
  1. Understanding the long-chain in marketing is crucial for connecting business outcomes with data and metrics.
  2. Data engineering and knowledge management are essential for transforming data into valuable assets that can be monetized by the business.
  3. Long-chain marketing involves seeing marketing efforts as part of a longer sequence of actions that lead to business outcomes, rather than standalone events.
Data Analysis Journal 373 implied HN points 25 Oct 23
  1. Learning data is more accessible and better now than in the past years.
  2. For transitioning into data engineering, focus on SQL, programming, data warehouse, and data pipelines.
  3. Analysts should focus on understanding the business problem, building maintainable systems, and following a data analytics process.
SeattleDataGuy’s Newsletter 1048 implied HN points 11 Apr 23
  1. Data engineering and machine learning pipelines are essential components for every company, but are often confused because they have different objectives.
  2. Data engineering pipelines involve data collection, cleaning, integration, and storage, while machine learning pipelines focus on data cleaning, feature engineering, model training, evaluation, registry, deployment, and monitoring.
  3. Both data and ML pipelines require careful consideration of computational needs to handle sudden changes, and understanding the differences between them is important for effective data processing and decision-making.
SeattleDataGuy’s Newsletter 671 implied HN points 23 Apr 23
  1. Data engineering is crucial in today's data-driven landscape, with a growing demand for skilled professionals.
  2. Developing technical skills like architecture, data modeling, coding, testing, and CI/CD is essential for becoming a successful data engineer.
  3. Non-technical skills such as teaching, long-term project planning, and communication are equally important for data engineers to excel and become force multipliers.
SwirlAI Newsletter 432 implied HN points 28 Jun 23
  1. The newsletter provides a Table of Contents with more than 90 topics, making it easier to find the content of interest.
  2. Topics covered include Data Engineering fundamentals, Spark architecture, Kafka use cases, MLOps deployment processes, System Design examples, and more.
  3. If interested, it's recommended to support the author's work by subscribing and sharing the content.
SwirlAI Newsletter 412 implied HN points 18 Jun 23
  1. Vector Databases are essential for working with Vector Embeddings in Machine Learning applications.
  2. Partitioning and Bucketing are important concepts in Spark for efficient data storage and processing.
  3. Vector Databases have various real-life applications, from natural language processing to recommendation systems.
Data Engineering Central 393 implied HN points 15 May 23
  1. Working on Machine Learning as a Data Engineer is not as hard as it seems - it falls somewhere in the middle of difficulty.
  2. Machine Learning work for Data Engineers focuses on MLOps like feature stores, model prediction, automation, and metadata storage.
  3. The key aspects of MLOps include automating tasks, using tools like Apache Airflow, and managing metadata for a stable ML environment.
SwirlAI Newsletter 373 implied HN points 15 Apr 23
  1. Partitioning and bucketing are two key data distribution techniques in Spark.
  2. Partitioning helps improve performance by allowing skipping reading the entire dataset when only a part is needed.
  3. Bucketing is beneficial for collocating data and avoiding shuffling in operations like joins and groupBys.
Three Data Point Thursday 39 implied HN points 11 Jan 24
  1. Synthetic data is fake data that is becoming increasingly practical and valuable.
  2. Generative AI and the growing gap between data demand and availability are driving forces for the usefulness of synthetic data.
  3. Synthetic data is beneficial in various fields beyond just machine learning, offering opportunities for innovation and improvement.
SwirlAI Newsletter 294 implied HN points 18 Mar 23
  1. Learning to decompose a data system is crucial for better reasoning and understanding of large infrastructure
  2. Decomposing a data system allows for scalability, identification of bottlenecks, and total event processing latency optimization
  3. The different layers in a data system include data ingestion, transformation, and serving layers, each with specific functions and technologies
Data Engineering Central 294 implied HN points 10 Apr 23
  1. Airflow has been a dominant tool for data orchestration, but new tools like Prefect and Mage are challenging its reign.
  2. Prefect focuses on using Python for defining tasks and workflows, but may not offer enough differentiation from Airflow.
  3. Mage stands out for its focus on engineering best practices and providing a smoother developer experience, making it a compelling choice over Airflow for scaling up data pipelines.
SwirlAI Newsletter 255 implied HN points 07 May 23
  1. Watermarks in Stream Processing help handle event lateness and decide when to treat data as 'late data'.
  2. In SQL Query execution, the order is FROM and JOIN, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.
  3. To optimize SQL Queries, reduce dataset sizes for joins and use subqueries for pre-filtering.
davidj.substack 167 implied HN points 19 Jul 23
  1. The Modern Data Stack (MDS) community has grown significantly over the years with various meetups and events.
  2. Using tools like Snowflake, dbt, and Looker in the Modern Data Stack improves data capabilities and productivity.
  3. Although some criticize the Modern Data Stack and its imperfections, it has greatly enhanced data handling and analytics for many organizations.
Data Engineering Central 137 implied HN points 24 Jul 23
  1. Data Engineers may have a love-hate relationship with AWS Lambdas due to their versatility but occasional limitations.
  2. AWS Lambdas are under-utilized in Data Engineering but offer benefits like cheap solutions, ease of use, and driving better practices.
  3. AWS Lambdas are handy for processing small datasets, running data quality checks, and executing quick logic while reducing architecture complexity and cost.
Data Engineering Central 216 implied HN points 13 Feb 23
  1. Data Engineers often struggle with implementing unit tests due to factors like focus on moving fast and historical lack of emphasis on testing.
  2. Unit testable code in data engineering involves keeping functions small, minimizing side effects, and ensuring reusability.
  3. Implementing unit tests can elevate a data team's performance and lead to better software quality and bug control.
Data Engineering Central 157 implied HN points 24 Apr 23
  1. Brittleness in data pipelines can lead to various issues like data quality problems, difficult debugging, and slow performance.
  2. To overcome brittle pipelines, focus on addressing data quality issues through monitoring, sanity checks, and using tools like Great Expectations.
  3. Development issues such as lack of tests, poor documentation, and bad code practices contribute to brittle pipelines; implementing best practices like unit testing and Docker can help improve pipeline reliability.
Data Engineering Central 157 implied HN points 13 Mar 23
  1. Understanding Data Structures and Algorithms is important for becoming a better engineer, even if you may not use them daily.
  2. Linked Lists are a linear data structure where elements are not stored contiguously in memory but are linked using pointers.
  3. Creating a simple Linked List in Rust involves defining nodes with values and pointers to other nodes, creating a LinkedList to hold these nodes, and then linking them to form a chain.
Data Engineering Central 137 implied HN points 20 Mar 23
  1. Future proof yourself against AI to stay relevant in the changing landscape of software engineering.
  2. There are three types of people when it comes to AI and programming: those who don't use AI and dismiss it, those who use it to enhance their work, and those who rely on it completely and may become less effective engineers.
  3. The impact of AI on software engineering is inevitable and will lead to changes in the field over time.
Counting Stuff 54 implied HN points 11 Jul 23
  1. It is beneficial to have familiarity with running a small server to learn skills and appreciate the work of Ops and SRE professionals.
  2. Consider the value of running a small server for hosting personal projects like a homepage or resume.
  3. Exploring web-based RSS apps can help manage information overload and stay updated with blogs and newsletters.