The hottest Data Quality Substack posts right now

And their main takeaways

Issue #16 - The Data Quality Conundrum (Part 2 – Solving)

The Data Ecosystem • 439 implied HN points • 28 Jul 24

Data quality isn't just a simple fix; it's a complex issue that requires a deep understanding of the entire data landscape. You can't just throw money at it and expect it to get better.
It's crucial to identify and prioritize your most important data assets instead of trying to fix everything at once. Focusing on what truly matters will help you allocate resources effectively.
Implementing tools for data quality is important but should come after you've set clear standards and strategies. Just using technology won’t solve problems if you don’t understand your data and its needs.

Issue #15 – The Data Quality Conundrum (Part 1 – Root Causes)

The Data Ecosystem • 399 implied HN points • 21 Jul 24

💼 Business Data Quality Management Analytics Governance Process Improvement

Poor data quality is a big problem for organizations, but it's often misunderstood. It's not just about fixing bad data; you need to figure out what's causing the issues.
Data quality has many aspects, like accuracy and completeness. Good data helps businesses make better decisions, while bad data can cost a lot of money.
To solve data quality issues, you need a complete approach that looks at different root causes. Simply fixing one part won't fix everything, and different sources might create new problems.

Does data quality matter?

benn.substack • 1099 implied HN points • 22 Nov 24

🕹 Technology Data Quality AI Models Software Development Business strategy Analytics

Data quality is important for making both strategic and operational decisions, as inaccurate data can lead to poor outcomes. Good data helps companies know what customers want and improve their services.
AI models can tolerate some bad data better than traditional methods because they average out inaccuracies. This means these models might not break as easily if some of the input data isn’t perfect.
Businesses now care more about AI than they used to about regular data reporting. This shift in focus might make data quality feel more important, even if it doesn’t technically impact AI model performance as much.

VC's Framework for Evaluating an AI Startup's Tech Stack. | VC Jobs

Venture Curator • 419 implied HN points • 06 Jun 24

🕹 Technology AI Startups Tech stack Data Quality Decision-making

The value proposition of AI companies now lies not just within models but predominantly in underpinning datasets, emphasizing the importance of data quality.
When evaluating AI startups, VCs use frameworks to assess data quality, considering relevance, accuracy, coverage, and bias in the datasets used to train the AI models.
To avoid investing in ineffectual AI startups, VCs focus on evaluating the processes behind data generation by asking questions about data automation, storage, access, processing, governance, and management.

Issue #12 – The Three Biggest Data Problems Companies Face

The Data Ecosystem • 239 implied HN points • 30 Jun 24

💼 Business Data Management Organizational Structure Process Improvement Data Quality Technology Integration

Companies often struggle with a data operating model that doesn't connect well with their other teams. This leads to isolation among data specialists, making it hard to work effectively.
Data models, which are important for understanding and using data correctly, are often overlooked. When organizations don’t reference these models, they can drift further away from their goals.
Many data quality issues come from deeper problems within the organization, like poor data governance and inconsistent processes. Fixing just the visible data quality issues won't solve the bigger problems.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

OpenAI’s Got 9.9 Problems, and Twitch Ain’t One

Marcus on AI • 3833 implied HN points • 27 Jan 24

🕹 Technology AI Legal Challenges Profits Data Quality

Lawsuits against OpenAI are likely to increase due to copyright infringement issues.
Battles over copyright materials may hinder OpenAI's profits and lead to ethical concerns.
OpenAI faces challenges with profitability, public confidence, and regulatory scrutiny post internal issues.

Are Data Contracts For Real?

Data Engineering Central • 294 implied HN points • 05 Feb 24

🕹 Technology Data Engineering Data Contracts APIs Data Quality Data Tools

Data Contracts may not be widely adopted in the data engineering community.
The idea behind Data Contracts is to enforce trustworthiness and consistency in data.
The challenge with Data Contracts seems to be the complexity and adoption of specific technologies.

How To Measure Data Quality - Issue 185

Data Analysis Journal • 235 implied HN points • 07 Feb 24

🕹 Technology Data science Analytics Data Quality Data Governance Metrics

Data quality metrics are essential for measuring data governance and analytics success.
There is no industry standard for defining poor-quality data; it varies based on context.
Having specific KPIs for data quality is crucial to scale data governance initiatives and improve the state of data quality.

Why Alerting is Key for enabling Generative AI and Machine Learning

The Orchestra Data Leadership Newsletter • 79 implied HN points • 23 Apr 24

🕹 Technology Data Operations AI Data Quality Data Pipelines

Alerting and governance are crucial for the success of Data and AI initiatives, as highlighted by the high failure rates of AI projects and Data Science projects not making it to production.
Building trust between Data Teams and Business Stakeholders is essential, and alerting plays a key role in this by ensuring effective communication and collaboration during data pipeline failures.
Effective alerting systems should be proactive, asset-based, and granular, allowing for quick detection and communication of issues to build trust and reliability in Data and AI products.

Navigating the Pitfalls of Data Projects

SeattleDataGuy’s Newsletter • 423 implied HN points • 02 Feb 24

🕹 Technology Project management Data Quality

Start your projects by planning backward from the end goal to identify dependencies and avoid delays.
Ensure there is clear ownership in the project to keep things moving and avoid stagnation.
Define clear completion criteria for tasks to avoid rework and keep the project progressing efficiently.

Lessons from customer evaluations of an AI product

The AI Frontier • 59 implied HN points • 18 Apr 24

🕹 Technology AI Product Development Data Quality Market Trends

Customers who have experience with AI products often have a better understanding of what to look for. They know what works and what doesn't, so they can more easily evaluate new AI tools.
The quality of data is super important for AI performance. If the data is good, the answers will be better, so paying attention to data quality is key.
Expectations around AI products can be tricky. Some people think AI is not useful, while others expect it to know everything. It's important to set clear expectations about what AI can do.

Will we ever have clean data?

benn.substack • 511 implied HN points • 28 Jul 23

🕹 Technology Data Quality Data Infrastructure Data Teams

Data quality is a tradeoff in balancing stability and agility.
Data resiliency tools like SDF focus on tracing data lineage to improve debugging and fixing issues.
Managing messy data often requires making choices between stability and adaptability in data infrastructure.

Shift-Left Analytics

SUP! Hubert’s Substack • 50 implied HN points • 22 Nov 24

🕹 Technology Data Analytics Real-Time Processing Machine Learning Data Quality Database Management

Shift-left analytics means doing analysis early in the data process. This helps in getting insights faster and making quick decisions.
It focuses on checking data quality right away, so only reliable data is used. This leads to more accurate insights and avoids problems caused by bad data.
Collaboration between teams is encouraged in this approach. By working together from the start, everyone can ensure their analyses are useful and aligned with business goals.

The State of Data Testing in 2024: Why We Still Have Broken Windows

Inside Data by Mikkel Dengsøe • 49 implied HN points • 18 Nov 24

🕹 Technology Data Quality Analytics Business Intelligence Software Development

Data teams are overwhelmed by too many alerts from test failures. This leads to important issues being overlooked.
It's crucial to focus on the right tests that have significant business impact rather than just mechanical tests. This means deeper insights into the data are needed.
Sharing the responsibility for data quality across teams can improve the situation. When everyone understands their role, issues are resolved faster.

Revisiting A/B tests in SaaS 💯

The SaaS Baton • 117 implied HN points • 26 Apr 23

🕹 Technology SaaS A/B Testing Data Quality Network Effects

Running A/B tests on SaaS products has unique challenges beyond just having enough users for statistically significant results.
Incorporating minimal clear constraints in projects can drive creativity and productivity, as seen in Buffer's Build Week.
Establishing indirect growth channels, like Gusto did with accounting firms, can create network effects and be a win-win for both parties.

Choosing a Data Quality Tool

Sarah's Newsletter • 359 implied HN points • 22 Feb 22

🕹 Technology Data Quality Data Tools

Data quality tools are essential for maintaining trust in data and preventing stakeholders from resorting to workaround solutions.
Choosing the right data quality tool involves understanding the specific needs of your organization and considering factors like budget, technical resources, and overall data quality goals.
There are different types of data quality tools available, including auto-profiling data tools, pipeline testing tools, infrastructure monitoring tools, and integrated solutions, each with unique characteristics and considerations for selection.

More than 30 unique tracking events will cause you problems

timo's substack • 78 implied HN points • 12 Feb 23

🕹 Technology Data Tracking Data Quality Data Analysis Documentation

Having more than 30 unique tracking events can lead to problems in data adoption and productivity.
Too many unique events can lead to difficulties in analyst productivity and data exploration.
Implementing a lean event approach with a focus on good event design and ownership can help prevent issues caused by high event volumes.

Joe's Nerdy Rants #4

Joe Reis • 78 implied HN points • 10 Jun 23

🕹 Technology AI Data Management Data Quality

Encourage kids and others to interact more in real life, consider alternatives to college, find careers that can't be easily automated, and learn to coexist with AI.
Embrace lifelong learning and be open to change in order to adapt to evolving technologies and industries.
Read up on interesting articles about tech, AI, data, and business topics for insights and inspiration.

The Making of Whisper: An In-Depth Exploration of its Training Data and Process

Amgad’s Substack • 39 implied HN points • 22 Dec 23

🕹 Technology Data Collection Data Quality

OpenAI's Whisper ASR model stands out for its accuracy, made possible by releasing both its architecture and checkpoints under an open-source license, setting a new standard of innovation in the field.
The training of AI models can be divided into supervised and unsupervised approaches, each with its unique strengths and limitations, with significant implications for achieving high-quality results.
Data curation is a critical aspect of model training, with OpenAI showcasing the importance of maintaining data integrity through a meticulous process of automated filtering, manual inspection, and guarding against data leakage.

What is SaaS Observability?

Sarah's Newsletter • 139 implied HN points • 30 Aug 22

🕹 Technology SaaS Observability Data Quality Automation Business Operations

SaaS Observability sheds light on the health of all data and automations in SaaS tools.
Business teams should not need to rely on technical-heavy tools to ensure their systems are working correctly.
Having bad data quality and anomalies in automations can impact business operations significantly and require constant monitoring.

The Froyo Data Shop

Sarah's Newsletter • 159 implied HN points • 22 Mar 22

🕹 Technology Data Analysis Data Quality Data Literacy

Self-service is about making choices with clear explanations and options.
Raw data without context can lead to misinterpretation and flawed analysis.
Data democratization needs testing, context building, and ongoing data literacy.

Who Cares if Big Data Is Dead!

Machine Learning for Developers • 39 implied HN points • 23 Feb 23

🕹 Technology Data Analytics Data Quality Data science Machine Learning Big Data

Data quality and data analytics motives matter more than the size of data.
Big data may not be as prevalent as believed, with most workloads processing only a small amount of data.
Too much data can lead to legal and privacy issues, making data quality paramount.

The (Not So Subtle) Art of Not Giving A Fuck About Data

Three Data Point Thursday • 39 implied HN points • 21 Sep 23

🕹 Technology Data Data Quality Data Teams

Don't focus too much on best practices and what other companies are doing in data.
Deal with challenges and adversity in data, but focus on doing the right thing.
Prioritize data quality when your company truly becomes data-driven.

The business-critical data warehouse

Inside Data by Mikkel Dengsøe • 41 implied HN points • 29 Jan 24

🕹 Technology AI ML Data Quality Data Management

The data warehouse market potential is growing significantly.
AI and ML are playing a major role in the evolution of data warehouses.
Teams are addressing the complexity of data stacks by focusing on data quality and treating data as a product.

Should Data Teams care about Data Contracts?

The Orchestra Data Leadership Newsletter • 19 implied HN points • 05 Nov 23

🕹 Technology Data Teams Data Contracts Data Engineering Software Engineering Data Quality

Consider data contracts if your internal data changes often to ensure collaboration between software engineering and data engineering teams.
If you have important metrics that depend on software engineering actions, like defining 'Active Users,' data contracts can help maintain data quality.
In cases where software engineering and data engineering roles overlap, implementing data contracts can streamline data ingestion processes and improve data quality.

Why Snowflake’s Clone command changes the game for CI/CD in Data

The Orchestra Data Leadership Newsletter • 19 implied HN points • 26 Oct 23

🕹 Technology DataOps CI/CD Data Engineering Data Quality

The Snowflake Clone command allows for cheap and quick testing of data during Continuous Integration flows, showing significant cost and time improvement compared to traditional create table commands.
Continuous Deployment can be facilitated through Snowflake clones, as they are relatively inexpensive, removing cost barriers for Data Teams in implementing effective CD processes.
The Clone command, available since at least 2016, is invaluable for Data Teams as it enables CI/CD pipelines, essential for deploying data into production reliably and efficiently via a data release pipeline.

A Digital Cohort on Personalized Nutrition

Digital Epidemiology • 19 implied HN points • 02 Jun 23

🔬 Science Nutrition Data Quality Recruitment Cognitive Performance

The study focused on personalized nutrition with a digital cohort of 1,000+ participants tracking various data for glucose level management.
Developing a digital cohort requires intricate digital infrastructure and investment in user-friendly applications for high retention rates.
Data quality assessment is crucial for multi-modal data collection, and the study achieved high completion rates with a focus on improving nutrition tracking.

Gradient Flow #46: Smarter Language Models; Data Engineering Trends

Gradient Flow • 99 implied HN points • 04 Nov 21

🕹 Technology Data science Data Engineering Artificial Intelligence Machine Learning Data Quality

Data scientists should transition into social scientists in addition to being computer scientists.
The report presents insights from a global online survey of 372 respondents on data engineering trends and challenges.
Information on improvements in large language models, modernizing data integration, and the importance of data quality is shared in the podcast.

Data Quality

Data Thoughts • 39 implied HN points • 21 Jan 23

🕹 Technology Data Management Data Quality Data Analysis Data Integrity Information Systems

Data quality is all about how useful the data is for the specific task at hand. What is considered high quality in one situation might not be in another.
There are several key aspects of data quality, including accuracy, completeness, consistency, and uniqueness. Each of these factors helps to determine how reliable the data is.
Improving data quality involves preventing errors, detecting them when they occur, and repairing them. It's about making sure the data is accurate and useful over time.

The Future of Vector Search

Gradient Flow • 2 HN points • 13 Jun 24

🕹 Technology AI Data Quality Data Transparency Open Source

When choosing a vector search system, focus on features like deployment scalability and performance efficiency to meet specific needs.
To ensure reliability and security, opt for systems that offer built-in embedding pipelines and integrate with data governance tools.
Prioritize data quality and transparency in AI applications, emphasizing reproducibility through sharing code, data, and detailed documentation.

Understanding Isn't Enough

Sarah's Newsletter • 59 implied HN points • 08 Feb 22

🕹 Technology Data Analytics Product Development Data Quality Process Optimization

Value in data products comes from taking action, not just providing information.
Vendors and data tools add significant value by influencing processes and saving time for users.
Analytics products should aim to change behaviors by answering critical questions, prioritizing effectively, and continuously refining to ensure effectiveness.

March, Etc.

Data People Etc. • 53 implied HN points • 15 Mar 23

🕹 Technology Data Modeling Data Management Data Quality

Intermediate data modeling can be valuable following Kimball design principles.
Attending events like Data Council can provide insights and networking opportunities.
Engaging in ongoing discussions and being part of a community can enhance the writing and learning experience.

Data & AI Strategy: Four Horsemen of Data Apocalypse

TeamCraft • 26 implied HN points • 18 Sep 23

🕹 Technology Data Strategy AI Data Quality Data Infrastructure Data Governance

Data quality is crucial for successful AI implementation.
Robust data infrastructure is needed to support effective data management.
Establishing a data governance framework and a positive data culture are essential for long-term success.

Sort Feature Overview: Better Data Through Collaboration

Database Engineering by Sort • 15 implied HN points • 01 Mar 24

🕹 Technology Data Quality Collaboration Database Management Workflow Transparency

Data quality is crucial for businesses as it influences customer experience, decision-making, and AI outcomes.
Collaboration is key for improving data quality, as automated tools can only address a portion of data issues.
Sort provides a platform for transparent collaboration on databases, allowing for public and private database sharing, issue tracking, proposing and reviewing database changes.

Gradient Flow #42: Data Quality; Oscilloscope for Deep Learning; Feature Stores

Gradient Flow • 39 implied HN points • 26 Aug 21

🕹 Technology Data Quality Deep Learning Machine Learning Cloud Computing

Data quality is crucial in machine learning and new tools like feature stores are emerging to improve data management.
Experts are working on auditing machine learning models to address issues like discrimination and bias.
Large deep learning models such as Jurassic-1 Jumbo with 178B parameters are being made available for developers.

Improving data quality in online studies

UX Psychology • 19 implied HN points • 23 Nov 21

🔬 Science Data Quality Data Collection Quality Assurance

In online studies, factors like distractions, poor equipment, and cheating can impact data quality.
Engagement levels, accuracy, outliers, and speed of responses are key indicators to assess data quality in online studies.
Strategies like consistency measures, attention checks, bot detection, and serious response checks can help improve data quality in online studies.

Tackling Data's Biggest Culture Problem

Data Products • 8 implied HN points • 12 Sep 23

🕹 Technology Data Management Data Quality Communication Collaboration Change Management

Modern software teams often treat data as a by-product of their application
Data teams take dependencies on upstream sources without a clear contract
Data contract is the solution to bring data producers and data consumers together for managing constraints on data assets

Why Data Quality Is More Important Than Ever in an AI-Driven World

Data Products • 5 implied HN points • 08 Jan 24

🕹 Technology Data Quality Artificial Intelligence Machine Learning Data Engineering Generative AI

Data quality is crucial for machine learning projects and can have negative impacts on both society and individuals.
Advances in Generative AI highlight the importance of high-quality data and the potential shortage of such data.
Data quality affects the machine learning product development cycle, including ongoing maintenance costs of ML pipelines.

The True Cost of Data Debt

Data Products • 5 implied HN points • 11 Oct 23

🕹 Technology Data Management Data Quality Data science Cloud Computing Data Governance

Data should be seen as an asset, not just a resource.
Data debt can lead to serious consequences like trust issues and organizational chaos.
Data developers need to focus on data quality tools like data contracts to prevent and manage data debt.

Gradient Flow #31: AI in Healthcare, Data Quality, Understanding Neural Networks

Gradient Flow • 19 implied HN points • 25 Mar 21

🕹 Technology AI Data Quality Neural Networks Machine Learning Data Tools

Podcast on Mathematics of Data Integration and Data Quality with Ryan Wisnesky from Conexus
Survey on AI and Machine Learning in Healthcare, Biotech, and Pharmaceutical industries
Various tools and infrastructure updates in Data & Machine Learning, like Apache Airflow and Evidently