Data Quality Troubleshooting
ML models rely on upstream data for training and making predictions. The data for these models is typically collected from multiple systems or vendors, or it may be owned by another team. This makes it challenging to ensure the data is always of high quality.
Since poor-performing data results in poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.
Data quality monitors typically inform various troubleshooting avenues, such as performance/drift tracing, and indicate problems along the model building and deployment pipeline.
Data quality metrics help inform various root cause issues. The most common causes of data quality issues in production are:
- Vendor Data: Purchasing 3rd party
- Feature Generation: Creating, transforming, extracting, selecting features
- Data Pipelines: Data engineering and training
- Latency: Delayes in your data pipeline