Data Quality Troubleshooting

How To Troubleshoot Data Quality Issues

ML models rely on upstream data to train and make predictions. Data is commonly collected from multiple systems, vendors, or can be owned by another team, making it difficult to ensure you always have high-quality data. Since poor-performing data can lead to poor-performing models, use data quality monitors to detect shifts in upstream data and alert underlying changes.
Data quality information typically informs various troubleshooting avenues, such as performance tracing and drift detection. Here are the common culprits of data quality issues.

Vendor Data

Vendor data is a common cause of data quality issues. If you purchase data from a 3rd party, use data quality monitors to alert you when the data provided by your vendor changes.

Data Consistency

To identify where our data quality issues break down, we'll use the 'Data Consistency' tab to surface potential problems in feature generation (i.e, materializing features), data pipelines, and even latency issues when fetching from online stores.
Data Consistency allows you to monitor differences between your offline features and online features. This can be found under the 'Projects' tab.
Data Consistency Metrics across features

Logging Offline Features

To monitor data consistency, you don't have to change anything in your production (i.e online features) workflow. You will only need to log your offline features using arize.log_validation_records then set up a match environment in the next step.
To have the proper data consistency environment set-up on Arize, you will need to make sure that all of the following are met:
  1. 1.
    You sent in the same prediction_ids when logging to production and validation
  2. 2.
    You sent in the same prediction_timestampswhen logging to production and validation
  3. 3.
    You sent all your offline data to the same batch_id such asbatch_id=`offline
You can always add to the same offline environment by calling arize.log_validation_records with the same batch_id.

Setting up Data Consistency on Arize

You will first have to create a new project containing your models with the same match environments. For example, all models with the same feature schema (such as one instance of a model deployed for a stores/cities/state) can use the same Projects page.
Then, you will want to set up the match environment to the batch_id which you decided as your offline data consistency measuring under Config. In this example, we logged to offline.
Data Consistency details may not show up immediately after the initial setup. If you have properly logged in to Arize, then the next day, you should see visualizations on your match environment.

Troubleshooting Data Consistency

By clicking on the mismatched features, you can see the feature match distribution difference between offline and online environments using our heat map feature and distribution widgets.
In this particular example. You can see that the offline features seem to experience a one-sided delay, signifying potential latency problems.