The hottest Data Engineering Substack posts right now

And their main takeaways
Category
Top Technology Topics
Ju Data Engineering Newsletter 396 implied HN points 28 Oct 24
  1. Improving the user interface is crucial for more teams to use Iceberg, especially those that use Python for their data work.
  2. PyIceberg, which is a Python implementation, is evolving quickly and currently supports various catalog and file system types.
  3. While PyIceberg makes it easy to read and write data, it has some limitations, especially compared to using Iceberg with Spark, like handling deletes and managing metadata.
SeattleDataGuy’s Newsletter 376 implied HN points 12 Feb 25
  1. Having a clear plan is crucial for successful data migration projects. You need to know what to move and in what order to avoid chaos.
  2. Ownership of the migration process is important. There should be a clear leader or team responsible to keep everything on track.
  3. Testing data after migration is a must. Just moving the data doesn't guarantee that it works the same way, so check for any discrepancies.
SeattleDataGuy’s Newsletter 812 implied HN points 06 Feb 25
  1. Data engineers are often seen as roadblocks, but cutting them out can lead to major problems later on. Without them, the data can become messy and unmanageable.
  2. Initially, removing data engineers may seem like a win because things move quickly. However, this speed can cause chaos as data quality suffers and standards break down.
  3. A solid data strategy needs structure and governance. Rushing without proper planning can lead to a situation where everything collapses under the weight of disorganization.
Ju Data Engineering Newsletter 515 implied HN points 17 Oct 24
  1. The use of Iceberg allows for separate storage and compute, making it easier to connect single-node engines to the data pipeline without needing extra steps.
  2. There are different approaches to integrating single-node engines, including running all processes in one worker or handling each transformation with separate workers.
  3. Partitioning data can improve efficiency by allowing independent processing of smaller chunks, which reduces the limitations of memory and speeds up data handling.
Data People Etc. 231 implied HN points 11 Feb 25
  1. Data is more powerful when it has a purpose. It should tell a clear story, otherwise it's just clutter.
  2. Building a strong data system is like creating a world. A good structure connects different pieces and helps everyone understand the bigger picture.
  3. Data engineering is important because it helps manage and present large amounts of information, making sure everything works smoothly and accurately.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
VuTrinh. 1658 implied HN points 24 Aug 24
  1. Parquet is a special file format that organizes data in columns. This makes it easier and faster to access specific data when you don't need everything at once.
  2. The structure of Parquet involves grouping data into row groups and column chunks. This helps balance the performance of reading and writing data, allowing users to manage large datasets efficiently.
  3. Parquet uses smart techniques like dictionary and run-length encoding to save space. These methods reduce the amount of data stored and speed up the reading process by minimizing the data that needs to be scanned.
VuTrinh. 399 implied HN points 17 Sep 24
  1. Metadata is really important because it helps organize and access data efficiently. It tells systems where files are and which ones can be ignored during processing.
  2. Google's BigQuery uses a unique system to manage metadata that allows for quick access and analysis of huge datasets. Instead of putting metadata with the data, it keeps them separate but organized in a smart way.
  3. The way BigQuery handles metadata improves performance by making sure that only the relevant information is accessed when running queries. This helps save time and resources, especially with very large data sets.
VuTrinh. 859 implied HN points 03 Sep 24
  1. Kubernetes is a powerful tool for managing containers, which are bundles of apps and their dependencies. It helps you run and scale many containers across different servers smoothly.
  2. Understanding how Kubernetes works is key. It compares the actual state of your application with the desired state to make adjustments, ensuring everything runs as expected.
  3. To start with Kubernetes, begin small and simple. Use local tools for practice, and learn step-by-step to avoid feeling overwhelmed by its many components.
VuTrinh. 139 implied HN points 24 Sep 24
  1. Google's BigLake allows users to access and manage data across different storage solutions like BigQuery and object storage. This makes it easier to work with big data without needing to move it around.
  2. The Storage API enhances BigQuery by letting external tools like Apache Spark and Trino directly access its stored data, speeding up the data processing and analysis.
  3. BigLake tables offer strong security features and better performance for querying open-source data formats, making it a more robust option for businesses that need efficient data management.
VuTrinh. 279 implied HN points 14 Sep 24
  1. Uber evolved from simple data management with MySQL to a more complex system using Hadoop to handle huge amounts of data efficiently.
  2. They faced challenges with data reliability and latency, which slowed down their ability to make quick decisions.
  3. Uber introduced a system called Hudi that allowed for faster updates and better data management, helping them keep their data fresh and accurate.

SDF

davidj.substack 59 implied HN points 12 Feb 25
  1. SDF and SQLMesh are alternatives to dbt for data transformation. They are both built with modern tech and aim to provide better ease of use and performance.
  2. SDF has a built-in local database, allowing developers to test queries without costs from a cloud data warehouse. This can speed up development and reduce costs.
  3. Both tools offer column-level lineage to track changes, but SQLMesh provides a better workflow for managing breaking changes. SQLMesh also has unique features like Virtual Data Environments that enhance developer experience.
VuTrinh. 519 implied HN points 27 Aug 24
  1. AutoMQ enables Kafka to run entirely on object storage, which improves efficiency and scalability. This design removes the need for tightly-coupled compute and storage, allowing more flexible resource management.
  2. AutoMQ uses a unique caching system to handle data, which helps maintain fast performance for both recent and historical data. It has separate caches for immediate and long-term data needs, enhancing read and write speeds.
  3. Reliability in AutoMQ is ensured through a Write Ahead Log system using AWS EBS, which helps recover data after crashes. This setup allows for fast failover and data persistence, so no messages get lost.
VuTrinh. 799 implied HN points 10 Aug 24
  1. Apache Iceberg is a table format that helps manage data in a data lake. It makes it easier to organize files and allows users to interact with data without worrying about how it's stored.
  2. Iceberg has a three-layer architecture: data, metadata, and catalog, which work together to track and manage the actual data and its details. This structure allows for efficient querying and data operations.
  3. One cool feature of Iceberg is its ability to time travel, meaning you can access previous versions of your data. This lets you see changes and retrieve earlier data as needed.
SeattleDataGuy’s Newsletter 800 implied HN points 20 Dec 24
  1. Being proactive means solving problems before they become bigger issues. If you see something that can be improved, go ahead and make that change instead of waiting for someone else to do it.
  2. Make sure your contributions are visible, so people recognize your work. Share your successes and updates with your team and leadership to build a stronger reputation.
  3. Become the go-to person for a specific area in your company. Focus on something valuable that can help others succeed, and make sure to share your knowledge and support with your team.
VuTrinh. 339 implied HN points 31 Aug 24
  1. Apache Iceberg organizes data into a data layer and a metadata layer, making it easier to manage large datasets. The data layer holds the actual records, while the metadata layer keeps track of those records and their changes.
  2. Iceberg's manifest files help improve read performance by storing statistics for multiple data files in one place. This means the reader can access all needed statistics without opening each individual data file.
  3. Hidden partitioning in Iceberg allows users to filter data without needing extra columns, saving space. It records transformations on columns instead, helping streamline queries and manage data efficiently.
SeattleDataGuy’s Newsletter 847 implied HN points 14 Dec 24
  1. Working in big tech offers many advantages like better tools and a strong focus on data. This environment makes it easier to get work done quickly and efficiently.
  2. Many companies outside big tech struggle with data because it's not their main focus. They often use a mix of different tools that don't work well together, leading to confusion.
  3. Without a strong data leader, companies may find it hard to prioritize data spending. If data isn't tied to profits, it's tougher to justify investing time and money into it.
VuTrinh. 399 implied HN points 20 Aug 24
  1. Discord started with its own tool called Derived to manage data, but it found this system limited as it grew. They needed a better way to handle complex data tasks.
  2. They switched to using popular tools like Dagster and dbt. This helped them automate and better manage their data processes.
  3. With the new setup, Discord can now make changes quickly and safely, which improves how they analyze and use their vast amounts of data.
VuTrinh. 519 implied HN points 06 Aug 24
  1. Notion uses a flexible block system, letting users customize how they organize their notes and projects. Each block can be changed and moved around, making it easy to create what you need.
  2. To manage the huge amount of data, Notion shifted from a single database to a more complex setup with multiple shards and instances. This change helps them handle stronger user demands and analytics needs more efficiently.
  3. By creating an in-house data lake, Notion saved a lot of money and improved data processing speed. This new system allows them to quickly get data from their main database for analytics and support new features like AI.
Data Science Weekly Newsletter 139 implied HN points 05 Sep 24
  1. AI prompt engineering is becoming more important, and experts share helpful tips on how to improve your skill in this area.
  2. Researchers in AI should focus on making an impact through their work by creating open-source resources and better benchmarks.
  3. Data quality is a common concern in many organizations, yet many leaders struggle to prioritize it properly and invest in solutions.
SeattleDataGuy’s Newsletter 730 implied HN points 21 Nov 24
  1. It's important to avoid building complex systems just for the sake of it. Focus on creating infrastructure that actually helps your team and the business.
  2. If you don’t plan your data model, you’ll end up with a messy one. Always take the time to design it properly to make future work easier.
  3. Good communication is really powerful. Being able to share your ideas clearly can help you get support and make a bigger impact in your projects.
Data Science Weekly Newsletter 179 implied HN points 29 Aug 24
  1. Distributed systems are changing a lot. This affects how we operate and program these systems, making them more secure and easier to manage.
  2. Statistics are really important in everyday life, even if we don't see it. Talks this year aim to inspire students to understand and appreciate statistics better.
  3. Understanding how AI models work internally is a growing field. Many AI systems are complex, and researchers want to learn how they make decisions and produce outputs.
VuTrinh. 279 implied HN points 17 Aug 24
  1. Facebook's real-time data processing system needs to handle huge amounts of data quickly, with only a few seconds of wait time. This helps in keeping things running smoothly for users.
  2. Their system uses a message bus called Scribe to connect different parts, making it easier to manage data flow and recover from errors. This setup improves how they deal with issues when they arise.
  3. Different tools like Puma and Stylus allow developers to build applications in different ways, depending on their needs. This means teams can quickly create and improve their applications over time.
VuTrinh. 299 implied HN points 13 Aug 24
  1. LinkedIn uses Apache Kafka to manage a massive flow of information, handling around 7 trillion messages every day. They set up a complex system of clusters and brokers to ensure everything runs smoothly.
  2. To keep everything organized, LinkedIn has a tiered system where data is processed locally in each data center, then sent to an aggregate cluster. This helps them avoid issues from moving data across different locations.
  3. LinkedIn has an auditing tool to make sure all messages are tracked and nothing gets lost during transmission. This helps them quickly identify any problems and fix them efficiently.
VuTrinh. 359 implied HN points 30 Jul 24
  1. Netflix's data engineering stack uses tools like Apache Iceberg and Spark for building batch data pipelines. This helps them transform and manage large amounts of data efficiently.
  2. For real-time data processing, Netflix relies on Apache Flink and a tool called Keystone. This setup makes it easier to handle streaming data and send it where it needs to go.
  3. To ensure data quality and scheduling, Netflix has developed tools like the WAP pattern for auditing data and Maestro for managing workflows. These tools help keep the data process organized and reliable.
VuTrinh. 299 implied HN points 03 Aug 24
  1. LinkedIn's data infrastructure is organized into three main tiers: data, service, and display. This setup helps the system to scale easily without moving data around.
  2. Voldemort is LinkedIn's key-value store that efficiently handles high-traffic queries and allows easy scaling by adding new nodes without downtime.
  3. Databus is a change data capture system that keeps LinkedIn's databases synchronized across applications, allowing for quick updates and consistent data flow.
VuTrinh. 539 implied HN points 06 Jul 24
  1. Apache Kafka is a system for handling large amounts of data messages, making it easier for companies like LinkedIn to manage and analyze user activity and other important metrics.
  2. In Kafka, messages are organized into topics and divided into partitions, allowing for better performance and scalability. This way, different servers can handle parts of the data at once.
  3. Kafka uses a pull model for consumers, meaning they can request data as they need it. This helps prevent overwhelming the consumers with too much data at once.
SeattleDataGuy’s Newsletter 541 implied HN points 14 Nov 24
  1. Use the 100-Day Data Engineering Crash Course to start learning the basics of data engineering. It covers important topics like SQL, programming, and Cloud technologies.
  2. Creating your own data projects can help you stand out. The Data Engineering Side Project Idea Template will guide you in planning unique projects that add value.
  3. Prepare well before job interviews with the Data Engineer Interview Study Guide. Always check with the recruiter about what to study so you can be ready.
VuTrinh. 339 implied HN points 23 Jul 24
  1. AWS offers a variety of tools for data engineering like S3, Lambda, and Step Functions, which can help anyone build scalable projects. These tools are often underused compared to newer options but are still very effective.
  2. Services like SNS and SQS can help manage data flow and processing. SNS allows for publishing messages while SQS aids in handling high event volumes asynchronously.
  3. Using AWS for data engineering is often simpler than switching to modern tools. It's easier to add new AWS services to your existing workflow than to migrate to something completely new.
Data Science Weekly Newsletter 139 implied HN points 22 Aug 24
  1. When building web applications, using Postgres for data storage is a good default choice. It's reliable and widely used.
  2. A new study shows that agents can learn useful skills without rewards or guidance. They can explore and develop abilities just from observing a goal.
  3. The list of important books and resources in Bayesian statistics is being compiled. It's a way to recognize influential ideas in this field.
SeattleDataGuy’s Newsletter 400 implied HN points 31 Oct 24
  1. SFTP stands for Secure File Transfer Protocol, and it's a popular method for companies to send and receive data securely, like sending packages in the digital world. Many businesses, even big tech ones, still rely on SFTP instead of newer methods.
  2. Setting up SFTP jobs requires careful planning, especially for user authentication and file encryption. Using SSH keys and methods like PGP encryption helps ensure the data remains safe during transfers.
  3. Although there are more advanced data-sharing technologies emerging, SFTP isn't going away anytime soon. Many companies still rely on SFTP for their data needs, showing its continued importance in the industry.
VuTrinh. 259 implied HN points 13 Jul 24
  1. Kafka uses the operating system's filesystem to store data, which helps it run faster by leveraging the page cache. This avoids the need to keep too much data in memory, making it simpler to manage.
  2. The way Kafka reads and writes data is done in a sequential order, which is more efficient than random access. This design improves performance, as accessing data in a sequence reduces delays.
  3. Kafka groups messages together before sending them, which helps reduce the number of requests made to the system. This batching process improves performance by allowing larger, more efficient data transfers.
Monthly Python Data Engineering 179 implied HN points 25 Jul 24
  1. The Python Data Engineering newsletter focuses on key updates and tools for building data engineering projects, rather than just data science.
  2. This month showcased rapid development in projects like Narwhals and Polars, with Narwhals making 26 releases and Polars reaching version 1.0.0.
  3. Several other libraries, such as Great Tables and Dask, also had important updates, making it a busy month for Python data engineering tools.
VuTrinh. 199 implied HN points 20 Jul 24
  1. Kafka producers are responsible for sending messages to servers. They prepare the messages, choose where to send them, and then actually send them to the Kafka brokers.
  2. There are different ways to send messages: fire-and-forget, synchronous, and asynchronous. Each method has its pros and cons, depending on whether you want speed or reliability.
  3. Producers can control message acknowledgment with the 'acks' parameter to determine when a message is considered successfully sent. This parameter affects data safety, with options that range from no acknowledgment to full confirmation from all replicas.
The Data Jargon Newsletter 138 implied HN points 23 Aug 24
  1. If your data product isn't making money, it's really just an internal tool. It's important to focus on projects that add real value.
  2. Having a good Business Intelligence team can often bring more benefits than trying to make fancy data products. Simple tools can lead to effective data use.
  3. More data engineers can improve your data platform, but just adding analysts might not directly make your data team better. It's all about how the team fits with the organization.
SeattleDataGuy’s Newsletter 317 implied HN points 23 Oct 24
  1. Building your own data orchestration system can lead to many challenges, like handling dependencies and scheduling tasks correctly. It's important to think if it's really necessary or if existing tools will work better.
  2. A custom orchestrator needs to manage various functions like logging, alerting, and integrating with other tools. Without proper features, it can become complex and hard to maintain.
  3. Before you decide to create your own solution, consider what makes it different and better than what's already available. Make sure to also think about how you’ll get people to use your new system.
VuTrinh. 219 implied HN points 02 Jul 24
  1. PayPal operates a massive Kafka system with over 85 clusters and handles around 1.3 trillion messages daily. They manage data growth by using multiple geographical data centers for efficiency.
  2. To improve user experience and security, PayPal developed tools like the Kafka Config Service for easier broker management and added access control lists to restrict who can connect to their Kafka clusters.
  3. PayPal focuses on automation and monitoring, implementing systems to quickly patch vulnerabilities and manage topics, while also optimizing metrics to quickly identify issues with their Kafka platform.
VuTrinh. 319 implied HN points 08 Jun 24
  1. LinkedIn processes around 4 trillion events every day, using Apache Beam to unify their streaming and batch data processing. This helps them run pipelines more efficiently and save development time.
  2. By switching to Apache Beam, LinkedIn significantly improved their performance metrics. For example, one pipeline's processing time went from over 7 hours to just 25 minutes.
  3. Their anti-abuse systems became much faster with Beam, reducing the time taken to identify abusive actions from a day to just 5 minutes. This increase in efficiency greatly enhances user safety and experience.
VuTrinh. 119 implied HN points 27 Jul 24
  1. Kafka uses a pull model for consumers, allowing them to control the message retrieval rate. This helps consumers manage workloads without being overwhelmed.
  2. Consumer groups in Kafka let multiple consumers share the load of reading from topics, but each partition is only read by one consumer at a time for efficient processing.
  3. Kafka handles rebalancing when consumers join or leave a group. This can be done eagerly, stopping all consumers, or cooperatively, allowing ongoing consumption from unaffected partitions.
VuTrinh. 339 implied HN points 25 May 24
  1. Twitter processes an incredible 400 billion events daily, using a mix of technologies for handling large data flows. They built special tools to ensure they can keep up with all this information in real-time.
  2. After facing challenges with their old setup, Twitter switched to a new architecture that simplified operations. This new system allows them to handle data much faster and more efficiently.
  3. With the new system, Twitter achieved lower latency and fewer errors in data processing. This means they can get more accurate results and better manage their resources than before.
Monthly Python Data Engineering 59 implied HN points 19 Aug 24
  1. Datafusion Comet was released, making it easier and faster to use Apache Spark for data processing, which is great for improving performance.
  2. Several major data tools like Datafusion, Arrow, and Dask updated their versions, showing ongoing improvements in speed, efficiency, and new features.
  3. New dashboard solutions like Panel and updates in libraries such as CUDF reflect the growing interest in making data access and visualization easier for users.