Data Quality Troubleshooting
ML models rely on upstream data to train and make predictions. Data is commonly collected from multiple systems, vendors, or can be owned by another team, making it difficult to ensure you always have high-quality data. Since poor-performing data can lead to poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.
Data quality information typically informs various troubleshooting avenues, such as performance tracing and drift detection. Here are the common culprits of data quality issues.
Vendor data is a common cause of data quality issues. If you purchase data from a 3rd party, use data quality monitors to alert you when the data provided by your vendor changes.
To identify where our data quality issues break down, we'll use the 'Data Consistency' tab to surface potential problems in feature generation (i.e, materializing features), data pipelines, and even latency issues when fetching from online stores.
Data Consistency allows you to monitor differences between your offline features and online features. This can be found under the 'Projects' tab.
Data Consistency Metrics across features
To monitor data consistency, you don't have to change anything in your production (i.e online features) workflow. You will only need to log your offline features using
arize.log_validation_recordsthen set up a match environment in the next step.
To have the proper data consistency environment set-up on Arize, you will need to make sure that all of the following are met:
- 1.You sent in the same
prediction_idswhen logging to production and validation
- 2.You sent in the same
prediction_timestampswhen logging to production and validation
- 3.You sent all your offline data to the same
You can always add to the same offline environment by calling
arize.log_validation_recordswith the same
You will first have to create a new project containing your models with the same match environments. For example, all models with the same feature schema (such as one instance of a model deployed for a stores/cities/state) can use the same Projects page.
Then, you will want to set up the match environment to the
batch_idwhich you decided as your offline data consistency measuring under Config. In this example, we logged to
Data Consistency details may not show up immediately after the initial setup. If you have properly logged in to Arize, then the next day, you should see visualizations on your match environment.
By clicking on the mismatched features, you can see the feature match distribution difference between offline and online environments using our heat map feature and distribution widgets.
In this particular example. You can see that the offline features seem to experience a one-sided delay, signifying potential latency problems.