The hottest Data Quality Substack posts right now

Stakeholders surface data quality issues, managers must balance responsiveness without burnout.
Prioritize issues by determining urgency, impact, and potential solutions.
Communicate clearly with technical stakeholders, implement fixes cautiously, and maintain trust through thorough communication.

Producers need to move towards consumer-defined data contracts to improve data quality and alignment with user needs.
A phased approach of awareness, collaboration, and contract ownership helps in successful data contract adoption.
Starting with consumer-defined contracts drives communication, awareness, and problem visibility, leading to long-term benefits.

Adversarial attacks in NLP models and computer vision models have been a growing concern, leading to research on generating defences and examples.
Tools like the SDV library from MIT can generate synthetic data for testing various applications beyond just machine learning models.
Companies and startups are increasingly addressing the importance of high-quality data through projects like Apache Griffin and Deequ.

Chad Sanderson announced an upcoming book on Data Contracts with O'Reilly, covering topics like what data contracts are, how they work, implementation, examples, and the future implications. The book will delve into Data Quality and Governance.
The first two chapters of the book are available for free on the O'Reilly website. They cover the importance of data contracts and the real goals of data quality initiatives, totaling about 45 pages of content.
Chad Sanderson is currently selecting technical reviewers for the book. Interested individuals can reach out to him to share their thoughts on an advance copy.

The quality and percentage of human-generated data on the internet may have reached a peak, affecting the efficacy of future AI models.
Models may face challenges with outdated training data and lack of relevant information for solving newer problems.
Potential solutions include leveraging RAG models, proactive data contribution by platform vendors, and maintaining incentives for human contributions on user-generated content platforms.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Models need to generate data by themselves for self-improvement, seen in examples like AlphaZero.
Models should adapt to new domains without requiring vast existing data, like the CLIP model.
Improving efficiency of models, like auto regressive sampling, is crucial for advancement in AI development.

Take control of event data by implementing server-side tracking for better data quality and faster implementation.
Incorporate the development team in tracking projects from the start to achieve more effective server-side tracking implementations.
Consider different strategies for implementing server-side tracking, such as close to the API layer, stream, database, third-party applications, or application code.

Healthy working agreements are essential for data quality.
Quicker feedback loops close to data sources improve data quality validation.
Celebrating the efforts of all team members contributes to maintaining clean data warehouses.

Data quality is essential for great AI products and services, emphasizes the need for tools like Great Expectations for validation and testing.
There is a rising demand for data engineers, illustrated by the funding announcements of Streamlit, Flatfile, and Snorkel.
Exploiting machine learning pickle files is a concern, with an open source tool discussed to reverse engineer and test these files.

Polars and Pandas are tools for data processing, but they have different performance levels. Understanding when to use each can help manage large datasets better.
Data quality is crucial for successful data engineering. Companies like Google and Uber have strategies in place to ensure their data is accurate and reliable.
Learning SQL execution order can really help in data tasks. It outlines the steps SQL takes to process a query, which is key for optimizing database interactions.