The hottest Data Engineering Substack posts right now

And their main takeaways
Category
Top Technology Topics
SeattleDataGuy’s Newsletter 317 implied HN points 23 Oct 24
  1. Building your own data orchestration system can lead to many challenges, like handling dependencies and scheduling tasks correctly. It's important to think if it's really necessary or if existing tools will work better.
  2. A custom orchestrator needs to manage various functions like logging, alerting, and integrating with other tools. Without proper features, it can become complex and hard to maintain.
  3. Before you decide to create your own solution, consider what makes it different and better than what's already available. Make sure to also think about how you’ll get people to use your new system.
Data Science Weekly Newsletter 359 implied HN points 17 Mar 23
  1. AI and data science are evolving rapidly, making it challenging for many to keep up. It's common for professionals to feel overwhelmed as they try to understand new advancements.
  2. There's a growing discussion about whether we should slow down AI development. Some people believe we need to pause and figure out the implications of current technologies before moving forward.
  3. Many professionals are exploring career shifts between data science and data engineering. It's important to consider personal interests and skills when deciding which path to take.
Data Science Weekly Newsletter 1 HN point 19 Sep 24
  1. Reading The Data Science Weekly is a great way to stay updated on AI and machine learning topics. It shares links, news, and resources that can help anyone interested in these fields.
  2. There are many useful techniques in data science, like the Hampel Filter for outlier detection, which can help improve data quality. Exploring these methods can really enhance your understanding and skills.
  3. Effective communication is crucial in data science. How you explain your findings can significantly impact your career, so it's important to work on your communication skills.
The Orchestra Data Leadership Newsletter 59 implied HN points 20 Mar 24
  1. Apache Iceberg introduces Bring Your Own Storage (BYOS) concept, which is gaining popularity for efficient and reliable data management in distributed environments.
  2. Key features of Apache Iceberg include Atomic Transactions, Schema Evolution, Partitioning and Sorting, Time Travel, Incremental Data Updates, Metadata Management, and Compatibility with various data processing frameworks.
  3. Platforms like Snowflake are shifting towards supporting Iceberg due to its benefits in handling data efficiently and enabling a Bring Your Own Storage pattern.
VuTrinh. 79 implied HN points 10 Feb 24
  1. Snowflake separates storage and compute, allowing for flexible scaling and improved performance. This means that data storage can grow separately from computing power, making it easier to manage resources.
  2. Data can be stored in a cloud-based format that supports both structured and semi-structured data. This flexibility allows users to easily handle various data types without needing to define a strict schema.
  3. Snowflake implements unique optimization techniques, like data skipping and a push-based query execution model, which enhance performance and efficiency when processing large amounts of data.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
Data Science Weekly Newsletter 219 implied HN points 14 Jul 23
  1. Machine learning is making its way into finance, and researchers are identifying practical uses for it. This can help finance professionals learn new tools and statisticians find interesting financial problems to solve.
  2. AI platforms, like social media, are becoming crucial in our lives but can be confusing and unreliable. People are figuring out how to use these platforms effectively despite their unpredictability.
  3. Large language models are changing how data scientists work. These models can automate many tasks, allowing data scientists to focus on managing and assessing the AI's outputs.
Data Engineering Central 157 implied HN points 24 Apr 23
  1. Brittleness in data pipelines can lead to various issues like data quality problems, difficult debugging, and slow performance.
  2. To overcome brittle pipelines, focus on addressing data quality issues through monitoring, sanity checks, and using tools like Great Expectations.
  3. Development issues such as lack of tests, poor documentation, and bad code practices contribute to brittle pipelines; implementing best practices like unit testing and Docker can help improve pipeline reliability.
Data Engineering Central 157 implied HN points 13 Mar 23
  1. Understanding Data Structures and Algorithms is important for becoming a better engineer, even if you may not use them daily.
  2. Linked Lists are a linear data structure where elements are not stored contiguously in memory but are linked using pointers.
  3. Creating a simple Linked List in Rust involves defining nodes with values and pointers to other nodes, creating a LinkedList to hold these nodes, and then linking them to form a chain.
Data Science Weekly Newsletter 199 implied HN points 28 Jul 23
  1. Large language models use complex methods like word vectors and transformers to understand language, but this can be explained simply without heavy math. They need a lot of data to perform well.
  2. Using AI tools like ChatGPT for real-world programming tasks can streamline the coding process, as it allows for a more focused workflow without switching between different resources.
  3. Building effective data storage systems, like Amazon S3, involves overcoming interesting challenges and nuances, demonstrating the amazing technology behind big data management.
Data Science Weekly Newsletter 299 implied HN points 06 Apr 23
  1. Understanding linear programming can help solve complex problems using Python. It's useful in various fields and can optimize outcomes.
  2. MLOps is closely related to data engineering, showing that managing data for machine learning involves more engineering than initially thought.
  3. The new pandas 2.0 version has exciting features like the Apache Arrow backend, which will enhance its performance and capabilities.
SeattleDataGuy’s Newsletter 1048 implied HN points 11 Apr 23
  1. Data engineering and machine learning pipelines are essential components for every company, but are often confused because they have different objectives.
  2. Data engineering pipelines involve data collection, cleaning, integration, and storage, while machine learning pipelines focus on data cleaning, feature engineering, model training, evaluation, registry, deployment, and monitoring.
  3. Both data and ML pipelines require careful consideration of computational needs to handle sudden changes, and understanding the differences between them is important for effective data processing and decision-making.
Data Science Weekly Newsletter 319 implied HN points 09 Mar 23
  1. The newsletter shares interesting links about data science, machine learning, and AI each week. It’s a good way to keep up with new trends and knowledge in the field.
  2. There's a discussion on what databases should do but often don’t. Understanding these gaps can help you improve your data projects by knowing what to build yourself.
  3. AI's impact on jobs and industries is being researched, especially how language models like ChatGPT could change certain occupations. It's important to understand how AI can affect your career choices.
Data Science Weekly Newsletter 219 implied HN points 23 Jun 23
  1. AI technology is advancing quickly and can even cover public meetings, but we need to think carefully about its readiness for everyday use.
  2. Engineers can improve their people skills and interactions by applying the same problem-solving mindset they use in their technical work.
  3. Generative AI is becoming important in data science for creating synthetic data, which helps in privacy and enhances analysis without losing useful information.
Data Science Weekly Newsletter 219 implied HN points 16 Jun 23
  1. Using large language models can help kids learn to ask curious questions by automating the teaching process.
  2. New techniques for 3D space reconstruction can make indoor views on platforms like Google Maps look more realistic and interactive.
  3. There's a growing need to understand the value of personal data in online shopping, especially as new regulations come into play.
Dev Interrupted 14 implied HN points 09 Dec 25
  1. Pre-computing and storing large volumes of derived data wastes money and adds latency because most of it is never used. Shifting to real-time, incremental pipelines means you only compute what users actually need.
  2. Owning the full stack (hardware, training, and cloud) creates a competitive moat and can change leaderboard dynamics quickly. Design your systems to be model-agnostic and flexible so you don’t get locked into one provider.
  3. Typical engineering metrics like velocity or lines of code are often misleading; measure what exposes real friction, bottlenecks, and business outcomes. Use metrics to make the system legible and actionable, not just to produce executive reports.
Data Engineering Central 137 implied HN points 20 Mar 23
  1. Future proof yourself against AI to stay relevant in the changing landscape of software engineering.
  2. There are three types of people when it comes to AI and programming: those who don't use AI and dismiss it, those who use it to enhance their work, and those who rely on it completely and may become less effective engineers.
  3. The impact of AI on software engineering is inevitable and will lead to changes in the field over time.
Data Engineering Central 137 implied HN points 24 Jul 23
  1. Data Engineers may have a love-hate relationship with AWS Lambdas due to their versatility but occasional limitations.
  2. AWS Lambdas are under-utilized in Data Engineering but offer benefits like cheap solutions, ease of use, and driving better practices.
  3. AWS Lambdas are handy for processing small datasets, running data quality checks, and executing quick logic while reducing architecture complexity and cost.
Data Engineering Central 137 implied HN points 12 Jun 23
  1. Feature Stores are essential in machine learning for managing and serving features.
  2. Feature Stores provide consistency, reusability, efficiency, discoverability, and monitoring benefits.
  3. Popular Feature Store options include Databricks Feature Stores, Feast (open-source), Postgres, DynamoDB, and s3.
VuTrinh. 39 implied HN points 09 Apr 24
  1. LedgerStore at Uber can handle trillions of indexes, making it a powerful tool for managing large-scale data efficiently.
  2. Apache Calcite helps build flexible data systems with strong query optimization features, which are vital for many data applications.
  3. Spotify's data platform plays a critical role in their operations, guiding how to build effective data systems in organizations.
Sung’s Substack 79 implied HN points 02 Jan 24
  1. Having dirty hands from diving into actual projects is important for growth, rather than just focusing on certifications or theory.
  2. Solving real problems in public and getting your hands dirty in open source can have a significant impact on your career, surpassing the importance of certifications.
  3. Engaging in hands-on experience and collaborating on projects that matter can lead to valuable personal growth and career advancement.
Data Science Weekly Newsletter 179 implied HN points 30 Jun 23
  1. Data scientists are sharing tips on how to make their scientific data more accessible and useful. This helps others to understand and use the data better.
  2. There are many discussions happening about the benefits and drawbacks of large language models (LLMs) like ChatGPT. Some people believe they are amazing, while others think they aren't very helpful.
  3. Naming things in programming can be tough, but there are resources and books that can help. Learning the right naming conventions can improve coding practices.
Gradient Flow 199 implied HN points 23 Feb 23
  1. The blend of artificial intelligence and chatbot interfaces, like seen in ChatGPT, is transforming search applications, with startups emphasizing large language models for better search experiences.
  2. Expectations around user interactions with company websites are changing with the rise of chatbot-equipped search engines, requiring integration of AI and foundation models for improved responses incorporating text, images, videos, and audio.
  3. Data and AI teams are crucial in developing, testing, and maintaining next-generation search applications, with companies likely seeking more control over their data and the potential creation of custom models for enhanced privacy and innovation.
Sung’s Substack 139 implied HN points 14 Mar 23
  1. Data engineering involves many tedious tasks and manual checks, hindering the ability to reach a state of flow
  2. Software engineers have smoother workflows and better tools compared to data engineers, allowing them to focus on their work and enjoy the process
  3. There is potential to improve the data engineering workflow by implementing real-time monitoring, interactive previews, and streamlined processes to enhance the experience
VuTrinh. 39 implied HN points 12 Mar 24
  1. GitHub uses a merge queue system that helps them quickly ship many code changes each day. This makes their deployment process faster and more efficient.
  2. Data governance is becoming really important, especially with the rise of generative AI. Companies need to ensure the data used by these systems is accurate and secure.
  3. The idea of 'Good Enough' data models suggests that it's okay to have models that meet basic needs instead of striving for perfection. This approach can save time and resources.
VuTrinh. 59 implied HN points 13 Jan 24
  1. BigQuery uses a method called definition and repetition level for efficient storage of nested and repeated data. This allows for reading specific parts of data without needing to access other related data.
  2. In columnar storage, data is organized by columns which can improve performance, especially for analytical queries, because only the needed columns are loaded.
  3. Using this method might increase file sizes due to redundancy, but it helps reduce the input/output operations needed when accessing nested fields.
The Orchestra Data Leadership Newsletter 79 implied HN points 26 Nov 23
  1. Data catalogs are not just for enterprises but also benefit startups by driving business value.
  2. Data catalogs help organizations manage and present their data assets in a user-friendly way for better adoption and value extraction.
  3. Using data catalogs can simplify data access, encourage collaboration between technical and business users, and potentially enhance BI functionalities within organizations.
potentialmind 19 implied HN points 18 May 24
  1. The demand for AI Engineers is skyrocketing due to advancements in AI, making it a high-demand engineering job of the decade.
  2. To excel in AI Engineering, practical knowledge and hands-on experience are prioritized over traditional academic qualifications like PhDs or specific courses like PyTorch.
  3. Modern applied AI is changing the landscape, making it easier for software engineers and product managers to leverage large language models and AI frameworks without extensive data collection.
SeattleDataGuy’s Newsletter 671 implied HN points 23 Apr 23
  1. Data engineering is crucial in today's data-driven landscape, with a growing demand for skilled professionals.
  2. Developing technical skills like architecture, data modeling, coding, testing, and CI/CD is essential for becoming a successful data engineer.
  3. Non-technical skills such as teaching, long-term project planning, and communication are equally important for data engineers to excel and become force multipliers.
VuTrinh. 19 implied HN points 30 Apr 24
  1. Netflix has created a platform called Data Gateway that helps their developers manage data more easily. It simplifies complex database processes so that app developers can focus on coding.
  2. The cloud storage triad talks about balancing latency, cost, and durability when storing data. Choosing the right storage solution can save money while ensuring data is always available.
  3. Managing data ingestion effectively is crucial for companies like RevenueCat. They faced challenges moving their data and found ways to optimize the process for better performance.
The Tech Buffet 79 implied HN points 01 Sep 23
  1. The Tech Buffet is a new newsletter focused on Machine Learning, Data Engineering, and Python Programming. It's designed to help people learn and improve their technical skills.
  2. You can expect weekly updates with practical advice, tutorials, and insights on making machine learning systems more efficient and effective.
  3. The creator wants feedback on what topics readers are interested in, so it's a community-driven project that aims to meet the needs of its audience.
Inside Data by Mikkel Dengsøe 41 implied HN points 04 Jul 25
  1. You can use AI to improve data modeling by cleaning raw data and structuring it effectively with tools like dbt. This makes your data easier to work with and analyze.
  2. Creating a good project structure from the start helps manage your data models better and prevents unnecessary refactoring later on. It's smart to plan how your project might grow.
  3. Using AI can save a lot of time in documenting and describing your data models. It helps automatically add useful descriptions, making it quicker to understand your data and its components.
timo's substack 78 implied HN points 26 Mar 23
  1. Finding a niche involves identifying what you enjoy and what is consistently needed in your projects.
  2. Tracking data is easily understood, but may have a negative reputation due to its association with web tracking practices.
  3. Measurement is a broader term than tracking, and data collection is often overlooked in the data engineering process.
VuTrinh. 19 implied HN points 23 Apr 24
  1. Canva's usage of creator content has skyrocketed, with data showing its growth doubling every 18 months. Managing the architecture to track this data is a significant challenge.
  2. Uber has developed strong testing and monitoring processes for its financial accounting data. This ensures accuracy and presents reliable external financial reports.
  3. With the rise of data lakehouses, utilizing tools like Apache Hudi and Paimon can enhance data storage and performance. These tools help build efficient and scalable data solutions.