The hottest Data Quality Substack posts right now

And their main takeaways
Category
Top Technology Topics
Data Products 3 implied HN points 04 Dec 23
  1. Producers need to move towards consumer-defined data contracts to improve data quality and alignment with user needs.
  2. A phased approach of awareness, collaboration, and contract ownership helps in successful data contract adoption.
  3. Starting with consumer-defined contracts drives communication, awareness, and problem visibility, leading to long-term benefits.
Gradient Flow 19 implied HN points 03 Dec 20
  1. Adversarial attacks in NLP models and computer vision models have been a growing concern, leading to research on generating defences and examples.
  2. Tools like the SDV library from MIT can generate synthetic data for testing various applications beyond just machine learning models.
  3. Companies and startups are increasingly addressing the importance of high-quality data through projects like Apache Griffin and Deequ.
Data Products 2 implied HN points 27 Feb 24
  1. Chad Sanderson announced an upcoming book on Data Contracts with O'Reilly, covering topics like what data contracts are, how they work, implementation, examples, and the future implications. The book will delve into Data Quality and Governance.
  2. The first two chapters of the book are available for free on the O'Reilly website. They cover the importance of data contracts and the real goals of data quality initiatives, totaling about 45 pages of content.
  3. Chad Sanderson is currently selecting technical reviewers for the book. Interested individuals can reach out to him to share their thoughts on an advance copy.
East Wind 2 HN points 25 Oct 23
  1. The quality and percentage of human-generated data on the internet may have reached a peak, affecting the efficacy of future AI models.
  2. Models may face challenges with outdated training data and lack of relevant information for solving newer problems.
  3. Potential solutions include leveraging RAG models, proactive data contribution by platform vendors, and maintaining incentives for human contributions on user-generated content platforms.
Get a weekly roundup of the best Substack posts, by hacker news affinity:
timo's substack 1 HN point 16 May 23
  1. Take control of event data by implementing server-side tracking for better data quality and faster implementation.
  2. Incorporate the development team in tracking projects from the start to achieve more effective server-side tracking implementations.
  3. Consider different strategies for implementing server-side tracking, such as close to the API layer, stream, database, third-party applications, or application code.
Gradient Flow 0 implied HN points 08 Apr 21
  1. Data quality is essential for great AI products and services, emphasizes the need for tools like Great Expectations for validation and testing.
  2. There is a rising demand for data engineers, illustrated by the funding announcements of Streamlit, Flatfile, and Snorkel.
  3. Exploiting machine learning pickle files is a concern, with an open source tool discussed to reverse engineer and test these files.
VuTrinh. 0 implied HN points 10 Oct 23
  1. Polars and Pandas are tools for data processing, but they have different performance levels. Understanding when to use each can help manage large datasets better.
  2. Data quality is crucial for successful data engineering. Companies like Google and Uber have strategies in place to ensure their data is accurate and reliable.
  3. Learning SQL execution order can really help in data tasks. It outlines the steps SQL takes to process a query, which is key for optimizing database interactions.