Troubleshoot Data Quality
Need help investigating data quality issues? ✨Copilot can help!
Troubleshoot Data Quality Issues
ML models rely on upstream data for training and making predictions. The data for these models is typically collected from multiple systems or vendors, or it may be owned by another team. This makes it challenging to ensure the data is always of high quality.
Since poor-performing data results in poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.
How To Understand Data Quality Metrics
Data quality monitors typically inform various troubleshooting avenues, such as performance/drift tracing, and indicate problems along the model building and deployment pipeline.
Common Root Causes
Data quality metrics help inform various root cause issues. The most common causes of data quality issues in production are:
Vendor Data: Purchasing 3rd party
Feature Generation: Creating, transforming, extracting, selecting features
Data Pipelines: Data engineering and training
Latency: Delayes in your data pipeline
Data Quality Metric | Common Root Cause |
---|---|
Count | An increase in count indicates duplicate data and an issue with your data pipeline. A decrease in count typically indicates latency issues or a broken data pipeline |
Percent Empty | The percent of nulls in your model features. Percent empty for
An increase in the percent empty indicates either a data pipeline issue or a feature generation issue (i.e. you create a feature but it's null in production) |
Cardinality | An increase in missing values/new values indicates a need to change your feature generation process and typically indicates feature drift |
Statistical Metrics | A change in statistical metrics indicates a change in the underlying data distribution |
Average List Length / Average Vector Length | This metric calculates the average of all the list lengths from each row and is available only for the Note: This metric omits empty lists or NULL values as missing values are captured in the percent empty metric. |
To measure data consistency metrics, learn more here.
Last updated