The hottest Data Engineering Substack posts right now

And their main takeaways
Category
Top Technology Topics
Monthly Python Data Engineering 179 implied HN points 25 Jul 24
  1. The Python Data Engineering newsletter focuses on key updates and tools for building data engineering projects, rather than just data science.
  2. This month showcased rapid development in projects like Narwhals and Polars, with Narwhals making 26 releases and Polars reaching version 1.0.0.
  3. Several other libraries, such as Great Tables and Dask, also had important updates, making it a busy month for Python data engineering tools.
VuTrinh. 199 implied HN points 20 Jul 24
  1. Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
  2. There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
  3. Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.
The Data Jargon Newsletter 138 implied HN points 23 Aug 24
  1. If your data product isn't making money, it's really just an internal tool. It's important to focus on projects that add real value.
  2. Having a good Business Intelligence team can often bring more benefits than trying to make fancy data products. Simple tools can lead to effective data use.
  3. More data engineers can improve your data platform, but just adding analysts might not directly make your data team better. It's all about how the team fits with the organization.
Pea Bee 183 implied HN points 29 Dec 25
  1. PressGuessr is a game that asks players to guess the publication year of Indian Express front pages using visual and textual clues.
  2. The dataset has over 13,000 front pages from 1932–2025 gathered from Google News Archive and PressReader, with publication dates programmatically blurred and many modern full-page ads removed.
  3. Building the game was enjoyable and it’s more challenging to play than expected, and you can try it at pressguessr.com.
VuTrinh. 219 implied HN points 02 Jul 24
  1. PayPal operates a massive Kafka system with over 85 clusters and handles around 1.3 trillion messages daily. They manage data growth by using multiple geographical data centers for efficiency.
  2. To improve user experience and security, PayPal developed tools like the Kafka Config Service for easier broker management and added access control lists to restrict who can connect to their Kafka clusters.
  3. PayPal focuses on automation and monitoring, implementing systems to quickly patch vulnerabilities and manage topics, while also optimizing metrics to quickly identify issues with their Kafka platform.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
VuTrinh. 319 implied HN points 08 Jun 24
  1. LinkedIn processes around 4 trillion events every day, using Apache Beam to unify their streaming and batch data processing. This helps them run pipelines more efficiently and save development time.
  2. By switching to Apache Beam, LinkedIn significantly improved their performance metrics. For example, one pipeline's processing time went from over 7 hours to just 25 minutes.
  3. Their anti-abuse systems became much faster with Beam, reducing the time taken to identify abusive actions from a day to just 5 minutes. This increase in efficiency greatly enhances user safety and experience.
VuTrinh. 119 implied HN points 27 Jul 24
  1. Kafka uses a pull model for consumers, allowing them to control the message retrieval rate. This helps consumers manage workloads without being overwhelmed.
  2. Consumer groups in Kafka let multiple consumers share the load of reading from topics, but each partition is only read by one consumer at a time for efficient processing.
  3. Kafka handles rebalancing when consumers join or leave a group. This can be done eagerly, stopping all consumers, or cooperatively, allowing ongoing consumption from unaffected partitions.
VuTrinh. 339 implied HN points 25 May 24
  1. Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
  2. After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
  3. With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.
Monthly Python Data Engineering 59 implied HN points 19 Aug 24
  1. Datafusion Comet was released, making it easier and faster to use Apache Spark for data processing, which is great for improving performance.
  2. Several major data tools like Datafusion, Arrow, and Dask updated their versions, showing ongoing improvements in speed, efficiency, and new features.
  3. New dashboard solutions like Panel and updates in libraries such as CUDF reflect the growing interest in making data access and visualization easier for users.
VuTrinh. 659 implied HN points 23 Mar 24
  1. Uber handles huge amounts of data by processing real-time information from drivers, riders, and restaurants. This helps them make quick decisions, like adjusting prices based on demand.
  2. They use a mix of open-source tools like Apache Kafka for data streaming and Apache Flink for processing, which allow them to scale their operations smoothly as the business grows.
  3. Uber values data consistency, high availability, and quick response times in their infrastructure. This means they need reliable systems that work well even when they're overloaded with data.
Data Science Weekly Newsletter 999 implied HN points 12 Jan 24
  1. Using ChatGPT can help you budget better. It can track and categorize your spending easily.
  2. When coding, it's important to find a balance between moving quickly and keeping your code well-structured. This is a real challenge for many developers.
  3. Language models, like GPT-4, are becoming very advanced, but there are big philosophical questions about what that really means for intelligence and understanding.
VuTrinh. 139 implied HN points 09 Jul 24
  1. Uber recently introduced Kafka Tiered Storage, which allows storage and compute resources to work separately. This means you can add storage without needing to upgrade processing power.
  2. The tiered storage system has two parts: local storage for fast access and remote storage for long-term data. This setup helps manage data efficiently and keeps the local storage less cluttered.
  3. When you need older data, it can be accessed directly from the remote storage, allowing faster performance for applications that need quick access to recent messages.
VuTrinh. 119 implied HN points 16 Jul 24
  1. Meta uses a complex data warehouse to manage millions of tables and keeps data only as long as it's needed. Data is organized into namespaces for efficient querying.
  2. They built tools like iData for data discovery and Scuba for real-time analytics. These tools help engineers find and analyze data quickly.
  3. Data engineers at Meta develop pipelines mainly with SQL and Python, using internal tools for orchestration and monitoring to ensure everything runs smoothly.
Data Science Weekly Newsletter 959 implied HN points 29 Dec 23
  1. This week, there's a focus on using data science techniques for practical decision-making, highlighted by an interview with Steven Levitt, who discusses making tough choices using data.
  2. There's a roundup of AI developments from 2023, showing how the field has evolved over the past year, which can help professionals stay updated.
  3. Understanding data quality is essential, as it directly impacts how useful data is for decision-making and analysis in any organization.
VuTrinh. 179 implied HN points 18 Jun 24
  1. Airbnb focuses on using open-source tools and contributing back to the community. This helps them build a strong and collaborative data infrastructure.
  2. Their data infrastructure prioritizes scalability and uses specific clusters for different types of jobs. This approach ensures that critical tasks run efficiently without overwhelming the system.
  3. Airbnb has improved their data processing performance significantly, reducing costs while increasing speed. This was achieved through careful planning and migration of their Hadoop clusters.
VuTrinh. 159 implied HN points 22 Jun 24
  1. Uber uses a Remote Shuffle Service (RSS) to handle large amounts of Spark shuffle data more efficiently. This means data is sent to a remote server instead of being saved on local disks during processing.
  2. By changing how data is transferred, the new system helps reduce failures and improve the lifespan of hardware. Now, servers can handle more jobs without crashing and SSDs last longer.
  3. RSS also streamlines the process for the reduce tasks, as they now only need to pull data from one server instead of multiple ones. This saves time and resources, making everything run smoother.
Data Science Weekly Newsletter 119 implied HN points 04 Jul 24
  1. Staying updated in data science, AI, and machine learning is essential for improving skills and knowledge. Weekly newsletters provide curated articles and resources that help you keep up with the latest trends.
  2. Effective structuring of data science teams can greatly enhance productivity. Learning from past experiences on team reorganizations can help in clarifying roles and increasing effectiveness.
  3. Building interactive dashboards in Python can make data more accessible. Using tools like PostgreSQL and specific libraries can simplify the process and enhance data visualization.
The Data Ecosystem 159 implied HN points 16 Jun 24
  1. The data lifecycle includes all the steps from when data is created until it is no longer needed. This helps organizations understand how to manage and use their data effectively.
  2. Different people and companies might describe the data lifecycle in slightly different ways, which can be confusing. It's important to have a clear understanding of what each term means in context.
  3. Properly managing data involves stages like storage, analysis, and even disposal or archiving. This ensures data remains useful and complies with regulations.
Data Science Weekly Newsletter 179 implied HN points 07 Jun 24
  1. Curiosity in data science is important. It's essential to critically assess the quality and reliability of the data and models we use, especially when making claims about complex issues like COVID-19.
  2. New fields, like neural systems understanding, are blending different disciplines to explore complex questions. This approach can help unravel how understanding works in both humans and machines.
  3. Understanding AI advancements requires keeping track of evolving resources. It’s helpful to have a well-organized guide to the latest in AI learning resources as the field grows rapidly.
Data Science Weekly Newsletter 99 implied HN points 11 Jul 24
  1. Large language models can sometimes create false or confusing information, a problem known as hallucination. Understanding the cause of these mistakes can help improve their accuracy.
  2. Good data visualizations are important to effectively communicate patterns and insights. Poorly designed visuals can lead to misunderstandings, especially among those not familiar with graphics.
  3. There's an ongoing debate about copyright in the context of generative AI. Many believe it would be better to focus on finding compromises rather than pursuing strict legal battles.
Data Science Weekly Newsletter 139 implied HN points 20 Jun 24
  1. Notebooks can be easy to use, but they might make you lazy in coding. It's important to follow good practices even when using them.
  2. When handling large datasets, it's crucial to learn how to scale effectively. Knowing how to use resources wisely can help you reach your goals faster.
  3. Retrieval Augmented Generation (RAG) can improve how models generate information. It's complex, but understanding it can boost the performance of your projects.
Data Science Weekly Newsletter 79 implied HN points 18 Jul 24
  1. AI research in China is progressing rapidly, but it hasn't received much attention compared to developments in the US. There are many complexities in understanding the implications of this advancement.
  2. There are new methods to improve large language models (LLMs) using production data, which can enhance their performance over time. A structured approach to analyzing data quality can lead to better outcomes.
  3. Evaluating modern machine learning models can be challenging, leading to some questionable research practices. It's important to understand these issues to ensure more accurate and reproducible results.
VuTrinh. 139 implied HN points 15 Jun 24
  1. Apache Druid is built to handle real-time analytics on large datasets, making it faster and more efficient than Hadoop for certain tasks.
  2. Druid uses a variety of node types—like real-time, historical, broker, and coordinator nodes—to manage data, process queries, and ensure everything runs smoothly.
  3. The architecture allows for quick data retrieval while maintaining high availability and performance, making it a strong choice for applications that need fast, interactive data exploration.
Data Science Weekly Newsletter 159 implied HN points 31 May 24
  1. Mediocre machine learning can be very risky for businesses, as it may lead to significant financial losses. Companies need to ensure their ML products are reliable and efficient.
  2. Understanding logistic regression can be made easier by using predicted probabilities. This approach helps in clearly presenting data analysis results, especially to those who may not be familiar with technical terms.
  3. Data quality management is becoming essential in today's data-driven world. It's important to keep track of how data is tested and monitored to maintain trust and accuracy in business decisions.
Data Science Weekly Newsletter 99 implied HN points 27 Jun 24
  1. Data visualization can show important patterns, like changes in night and daylight globally. Understanding these trends helps us appreciate our environment better.
  2. In AI engineering, simplifying data preparation is crucial. Many new AI applications can be built without structured data, which might lead to rushed expectations about their effectiveness.
  3. Aquaculture technology is evolving with better methods to track and analyze fish behavior. New approaches like deep learning are making monitoring more accurate and efficient.
Data Science Weekly Newsletter 279 implied HN points 05 Apr 24
  1. AI agents have unique challenges that traditional laws may not effectively solve. New rules and systems are needed to ensure they are managed properly.
  2. JS-Torch is a new JavaScript library that makes deep learning easier for developers familiar with PyTorch. It allows building and training neural networks directly in the browser.
  3. Data acquisition is crucial for AI start-ups to succeed. There are strategies outlined to help these businesses gather the right data efficiently.
VuTrinh. 99 implied HN points 25 Jun 24
  1. Uber is moving its huge amount of data to Google Cloud to keep up with its growth. They want a smooth transition that won't disrupt current users.
  2. They are using existing technologies to make sure the change is easy. This includes tools that will help keep data safe and accessible during the move.
  3. Managing costs is a big concern for Uber. They plan to track and control spending carefully as they switch to cloud services.
Data Science Weekly Newsletter 219 implied HN points 19 Apr 24
  1. Statistical ideas have a big impact on the world. Learning about important papers can help us understand how statistics shape modern research and decision-making.
  2. Machine Learning teams have different roles that face unique challenges. Understanding these personas can help leaders support their teams better.
  3. Using vector embeddings can greatly improve search experiences in apps. They simplify processes that previously seemed too complex and highlight their usefulness in technology.
VuTrinh. 179 implied HN points 04 May 24
  1. Delta Lake is designed to solve problems with traditional cloud object storage. It provides ACID transactions, making data operations like updates and deletions safe and reliable.
  2. Using Delta Lake, data is stored in Apache Parquet format, allowing for efficient reading and writing. The system tracks changes through a transaction log, which keeps everything organized and easy to manage.
  3. Delta Lake supports advanced features like time travel, allowing users to see and revert to past versions of data. This makes it easier to recover from mistakes and manage data over time.
Data Science Weekly Newsletter 139 implied HN points 24 May 24
  1. Good communication is key for statisticians to explain their complex work to non-experts. Finding ways to relate data to everyday situations can make it easier for others to understand.
  2. Using histograms can speed up the training process for gradient boosted machines in data science. This simple technique can improve efficiency significantly.
  3. There are efforts to use machine learning algorithms to detect type 1 diabetes in children earlier. This can help avoid serious health issues by improving recognition of symptoms.
VuTrinh. 119 implied HN points 04 Jun 24
  1. Uber is upgrading its data system by moving from its huge Hadoop setup to Google Cloud Platform for better efficiency and performance.
  2. Apache Iceberg is an important tool for managing data efficiently, and it can help create a more organized data environment.
  3. Building data products requires a strong foundation in data engineering, which includes understanding the tools and processes involved.
VuTrinh. 79 implied HN points 29 Jun 24
  1. YouTube built Procella to combine different data processing needs into one powerful SQL query engine. This means they can handle many tasks, like analytics and reporting, without needing separate systems for each task.
  2. Procella is designed for high performance and scalability by keeping computing and storage separate. This makes it faster and more efficient, allowing for quick data access and analysis.
  3. The engine uses clever techniques to reduce delays and improve response times, even when many users are querying at once. It constantly optimizes and adapts, making sure users get their data as quickly as possible.
VuTrinh. 139 implied HN points 21 May 24
  1. Working on pet projects is fun, but it's important to have clear learning goals to actually gain knowledge from them.
  2. When using tools like Spark or Airflow, always ask what problem they solve to understand their value better.
  3. To make your projects more effective, think like a user and check if they get what they need from your data systems.
Data Science Weekly Newsletter 259 implied HN points 22 Mar 24
  1. Data storytelling is important for sharing insights, and AI can help people create better stories. The research looks at how different tools assist in each storytelling stage.
  2. Switching from R to Python in data science isn't just about learning new syntax; it's a mindset change. New Python tools can help make this transition smoother for users coming from R's tidyverse.
  3. Emerging technologies often face skepticism, as seen throughout history. New inventions have raised concerns about their impact, but they eventually become part of everyday life.
Data Science Weekly Newsletter 379 implied HN points 02 Feb 24
  1. Forecasting in data science is challenging because time series data can be non-stationary. Using the right evaluation methods can help bridge the gap between traditional and modern forecasting techniques.
  2. It's important to consider the smartness of your data structures. Creating overly complicated dashboards that ultimately just produce simple outputs may not be the best use of time.
  3. There are clear distinctions between well-built data pipelines and amateur setups. Understanding what makes a pipeline production-grade can improve the quality and reliability of data processing.
Monthly Python Data Engineering 2 HN points 26 Sep 24
  1. A new free book called 'How Data Platforms Work' is being created for Python developers. It will explain the inner workings of data platforms in simple terms, with one chapter released each month.
  2. The Ibis library has removed the Pandas backend and now uses DuckDB, which is faster and has fewer dependencies. This change is expected to improve performance and usability.
  3. Several popular libraries in Python, such as GreatTables and Shiny, have released updates with new features and improvements, focusing on better usability and integration with modern technologies.
Data Science Weekly Newsletter 339 implied HN points 09 Feb 24
  1. Satellite data is important for machine learning and should be treated as a unique area of research. Recognizing this can help improve how we use this data.
  2. Many data science and machine learning projects fail from the start due to common mistakes. Learning from past experiences can help increase the chances of success.
  3. Open source software plays a crucial role in advancing AI technology. It's important to support and protect open source AI from regulations that could harm its progress.