Troubleshoot Data Quality

Last updated 6 months ago

Was this helpful?

Troubleshoot Data Quality

Need help investigating data quality issues? ✨can help!

Troubleshoot Data Quality Issues

ML models rely on upstream data for training and making predictions. The data for these models is typically collected from multiple systems or vendors, or it may be owned by another team. This makes it challenging to ensure the data is always of high quality.

Since poor-performing data results in poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.

How To Understand Data Quality Metrics

Data quality monitors typically inform various troubleshooting avenues, such as performance/drift tracing, and indicate problems along the model building and deployment pipeline.

Common Root Causes

Data quality metrics help inform various root cause issues. The most common causes of data quality issues in production are:

Vendor Data: Purchasing 3rd party
Feature Generation: Creating, transforming, extracting, selecting features
Data Pipelines: Data engineering and training
Latency: Delayes in your data pipeline

Data Quality Metric

Common Root Cause

Count

An increase in count indicates duplicate data and an issue with your data pipeline. A decrease in count typically indicates latency issues or a broken data pipeline

Percent Empty

The percent of nulls in your model features. Percent empty for

list of strings will count both empty lists and NULL values.

An increase in the percent empty indicates either a data pipeline issue or a feature generation issue (i.e. you create a feature but it's null in production)

Cardinality

An increase in missing values/new values indicates a need to change your feature generation process and typically indicates feature drift

Statistical Metrics

A change in statistical metrics indicates a change in the underlying data distribution

Average List Length / Average Vector Length

This metric calculates the average of all the list lengths from each row and is available only for the list of string data type.

Note: This metric omits empty lists or NULL values as missing values are captured in the percent empty metric.