Troubleshoot Data Quality
Troubleshoot Data Quality Issues
ML models rely on upstream data for training and making predictions. The data for these models is typically collected from multiple systems or vendors, or it may be owned by another team. This makes it challenging to ensure the data is always of high quality.
Since poor-performing data results in poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.
How To Understand Data Quality Metrics
Data quality monitors typically inform various troubleshooting avenues, such as performance/drift tracing, and indicate problems along the model building and deployment pipeline.
Common Root Causes
Data quality metrics help inform various root cause issues. The most common causes of data quality issues in production are:
Vendor Data: Purchasing 3rd party
Feature Generation: Creating, transforming, extracting, selecting features
Data Pipelines: Data engineering and training
Latency: Delayes in your data pipeline
Count
An increase in count indicates duplicate data and an issue with your data pipeline. A decrease in count typically indicates latency issues or a broken data pipeline
Percent Empty
The percent of nulls in your model features. Percent empty for
list of strings
will count both empty lists and NULL values.
An increase in the percent empty indicates either a data pipeline issue or a feature generation issue (i.e. you create a feature but it's null in production)
Cardinality
An increase in missing values/new values indicates a need to change your feature generation process and typically indicates feature drift
Statistical Metrics
A change in statistical metrics indicates a change in the underlying data distribution
Average List Length / Average Vector Length
This metric calculates the average of all the list lengths from each row and is available only for the list of string
data type.
Note: This metric omits empty lists or NULL values as missing values are captured in the percent empty metric.
To measure data consistency metrics, learn more here.
Last updated