Data Quality Monitors

Data quality monitoring reference guide

When To Monitor Data Quality

High-quality data is fundamental to building reliable, accurate machine learning models and the value of predictions can be significantly compromised by poor data quality.
Easily root cause model issues by monitoring key data quality metrics to identify cardinality shifts, data type mismatches, missing data, and more.
🏃 Common Questions:
📜 How should I monitor if I'm concerned about data pipeline issues?
Your data pipeline may occasionally fail or inadvertently drop features. Use count and percent empty monitors to catch these issues.
🛍️ How should I monitor 3rd party/purchased data?
3rd party data is a common culprit of many model performance problems. Use data quality monitors to keep track of quantiles and sum or average values of your 3rd party data.
🚅 How should I monitor my features if I frequently retrain my model?
Every model retrain has the possibility of introducing inadvertent changes to features. Use data quality monitors to compare new values and missing values between your production and your training or validation datasets.
🚂 How should I monitor my pipeline of ground truth data?
Monitor your actuals with percent empty and count to capture any failures or errors in your ground truth. pipeline.
🔔 My data quality alerts are too noisy/not noisy enough
Edit your threshold value above or below the default standard deviation value to temper your alerts.

Data Quality Metrics

Percent Empty
The percent of nulls in your model features. A high percentage can significantly influence model performance and a sudden increase in null values could indicate a problem in the data collection or preprocessing stages.
Cardinality (Count Distinct)
The cardinality of your categorical features. Changes in your feature cardinality could indicate a change in the feature pipeline, or a new or deprecated product feature that your model has not adapted to yet.
Count of new unique values that appear in production but not in the baseline. Identify concept drift or changes in the data distribution over time. These new unique values may not have been accounted for during model training and therefore could lead to unreliable predictions.
Count of new unique values that appear in the baseline but not in production. Can indicate changes in data generation processes or an issue with data collection in the production environment.
p99.9, p99, p95, p50 A detailed understanding of the underlying statistical properties of the data and its spread. Any significant shift in these quantiles could indicate a change in the data distribution, and require retraining.
The sum of your numeric data over the evaluation window. Detect anomalies or shifts in the data distribution. Significant changes in the sum might indicate data errors, outliers, or systemic changes in the process of generating the data.
Traffic count of predictions, features, etc. Can be used with filters. Ensure aren't any unexpected surges or drops in traffic that could affect performance and provide valuable insights about usage patterns, for better resource management and planning.
Average of your numeric data over the evaluation window. May indicate a systematic bias, a change in the data collection process, or an introduction of anomalies, which can adversely impact the performance and signal when your model may need retraining.

How To Monitor Data Quality

Step 1: Enable Data Quality Monitors

Monitor your data quality based on various metrics for your model use case.
You can enable managed data quality monitors automatically and tailor them to your needs or fully customize your data quality monitors.
Managed Monitors
Monitors configured by Arize with default settings for your threshold and evaluation window. These are meant to simple to enable and understand, with sensible defaults.
Custom Monitors
Fully customizable monitors based on various dimensions such as features, tags, evaluation windows, baselines, etc.
Managed Monitor
Custom Monitor

Managed monitors are configured by Arize with default settings.

Using Managed Monitors

Use managed monitors if this is your first time monitoring your model, you want to try a new metric, or simplify your setup workflow!
From the 'Setup Monitors' tab, enable the applicable data quality monitors based on various data quality metrics.
Managed data quality monitors will create a separate data quality monitor based on your desired metric across all applicable features.
Enabled monitors are represented in the monitors listing page
Enabled monitors in the monitors listing page

Using Custom Monitors

Since managed monitors create data quality monitors for all applicable features with default settings, use custom monitors if you want to monitor a specific feature, tag, or model dimension that matters the most to you.
From the 'Setup Monitors' or 'Monitor Listing' tab, click 'Create Custom Monitor' to get started.
Enable custom data quality monitors from the 'Monitor Listing' tab
From there, select the dimension category and dimension to monitor in Step 1: Define the Metric
Custom monitor page

Step 2: Configure Evaluation Window

An evaluation window defines the period of time your metric is calculated on (i.e. the previous 30 days). Increase this window to smooth out spiky or seasonal data. Decrease this for your monitors to react faster to sudden changes.
A delay window defines is the gap between the evaluation time and the window of data used for the evaluation. A delay window tells Arize how long to delay an evaluation. Change this if you have delayed actuals or predictions, so you evaluate your model on the most up-to-date data.
Managed Monitor
Custom Monitor
Managed monitors create monitors for all applicable features for a given metric with preset basic configurations. Based on the metric and feature monitor you want to edit, edit your monitor's details. These settings apply to all managed monitors of the same type.

Managed Monitors Default Configurations:

  • Evaluation Window: 72 hours of production data
  • Delay Window: 0 hours
From the 'Monitors' tab, edit the monitor configurations in the 'Managed Data Quality Monitors' card.
Define the various settings that go into calculating and monitoring your metric. Within monitor settings, configure the evaluation window within Step 2: Define the Data.

Custom Monitor Dimensions

Setting name
Evaluation window
Default: last 24 hours
Increase this to smooth out spikes or seasonality. Decrease this to react faster to potential incidents.
Evaluation delay
Default: delayed by 0 seconds This setting is the gap between the evaluation time and the window of data used for the evaluation. Use this if your predictions or actuals have an ingestion lag.
Model version
Filter your metric to only use certain model versions. This defaults to include all model versions.
You can filter using a variety of operators on any dimension in your model. The dimension can be a prediction, actuals, features, or tags.
step 2 in custom monitor setup

Step 3: Calibrate Alerting Threshold

Arize monitors trigger an alert when your monitor crosses a threshold. You can use our dynamic automatic threshold or create a custom threshold. Thresholds trigger notifications, so you can adjust your threshold to be more or less noisy depending on your needs.
Automatic Threshold
Automatic thresholds set a dynamic value for each data point. Arize generates an auto threshold when there are at least 14 days of production data to determine a trend.
Custom Threshold
Set the threshold to any value for additional flexibility. The threshold in the monitor preview will update as you change this value, so you can backtest the threshold value against your metric history.
Learn more here about how an auto threshold value is calculated.
Managed Monitor
Custom Monitor
Managed monitors create monitors for all applicable features for a given metric with an automatic threshold. If you've had issues in the past, we suggest you take a look at the threshold to make sure the threshold is relevant to your needs.

How To Edit Managed Monitor's Threshold In Bulk

Change the tolerance of an existing automatic threshold by adjusting the number of standard deviations used in the calculation in the 'Managed Data Quality Monitors' card in the 'Config' tab on the Monitors page to edit all of your managed monitor auto thresholds in bulk.
Note: this will override any individual managed data quality monitor auto threshold config, but will not change any manual thresholds configured for monitors.

How To Edit Managed Monitor's Threshold Per Monitor

Edit an individual managed monitor's threshold by referencing the 'Custom Monitor' tab.
Define the threshold value that will trigger an alert within Step 3: Define the Alerting.
This section allows you to:
  • Set a specific (custom) threshold if you already know the precise threshold value to use
  • Automatically create a dynamic threshold. You can edit your auto threshold sensitivity by changing the standard deviation number. Lowering the number of standard deviations will increase the sensitivity, and decreasing the standard deviation number will decrease the sensitivity.

Step 4: Set Notifications

Your Monitor Status provides an indication of your model health. Your monitor will either be:
  • Healthy: Sit back and relax! No action is needed
  • No Data: When the monitor does not have recent
  • Triggered: When your monitor crosses the threshold value, indicating a model issue
When a monitor is triggered, get notified when your model deviates from your threshold. You can send notifications via e-mail, PagerDuty, OpsGenie, or Slack. Learn more about notifications and integrations here.
Managed Monitor
Custom Monitor
All managed monitors will be set with the default configuration of 'No Contacts Selected'. To get the most out of Arize, set notifications so you are automatically notified when your monitor is triggered. You can edit notifications in bulk edit notifications per monitor for enhanced customizability.

How To Set Managed Monitors Notifications In Bulk

Configure data quality monitor notifications for all managed monitors for an easy way to fully set up monitors in Arize in the 'Config' tab on the Monitors page.

How To Edit Managed Monitor's Notifications Per Monitor

Set notifications per monitor to limit notifications, change alerting providers or add individual emails to the alert. Within each monitor, you can add a note and edit the monitor name to better suit naming conventions you may already have.
Edit an individual managed monitor's notification setting by referencing the 'Custom Monitor' tab
Define where you alerts are sent in Step 4: Define the Alerting
Setting name
Monitor Name
The monitor name is used to identify the monitor and will be used in the notification.
Send Notifications to
Choose your notification contacts. You can select multiple contacts to receive notifications. Learn more here.
Add notes to your monitor to help the alert recipient understand the monitor and quickly debug any issues.