ML Monitor Types

Learn about the different monitor options in Arize

Performance Monitor Metrics by Model Type

Model performance metrics measure how well your model performs in production. Monitor model performance with daily or hourly checks using an evaluation metric. Your model type determines your performance metric.

Model TypeMetrics

Classification

Accuracy, Recall, Precision, FPR, FNR, F1, Sensitivity, Specificity

Regression

MAPE, MAE, RMSE, MSE, R-Squared, Mean Error

Ranking

NDCG@k, AUC@k

Ranking Labels

MAP@k, MRR

AUC/LogLoss

AUC, PR-AUC, Log Loss

Computer Vision/ Object Detection

Accuracy, MAP, IoU

Custom Metrics

Not seeing what you're looking for? Create a metric yourself!

Drift Monitor Metrics

Arize offers various distributional drift metrics to choose from when setting up a monitor. Each metric is tailored to a specific use case; refer to this guide to help choose the appropriate metric for various ML use cases.

MetricDescription

A metric that is less influenced by sample size and offers fewer false positives compared to the Kolmogorov-Smirnov test or Earth Mover's Distance, making it suitable for datasets with expected fluctuations. However, PSI can be affected by the chosen binning strategy. A notable attribute of PSI is its symmetry, confirming its status as a true statistical 'distance'.

  • Sample size has less of an effect on PSI

  • Less sensitive, but will have fewer False positives when compared to KS or EMD (use PSI if you expect fluctuations in your data and don’t want too many false alarms)

  • Binning Strategy can affect the calculation of PSI

  • A true statistical ‘distance’, having the property of symmetry

  • PSI(A -> B) == PSI(B->A)

Euclidean distance check determines if the group of production data’s average centroid has moved away from the baseline group For unstructured data types, learn more here.

A metric that's less sensitive than others like the Kolmogorov-Smirnov statistic, thereby producing fewer false positives and making it appropriate for datasets with expected fluctuations. While its calculation can be influenced by the chosen binning strategy, it's less affected by sample size. Unlike PSI, KL divergence is non-symmetric, meaning the divergence from dataset A to B is not the same as from B to A.

  • Less sensitive than other metrics (such as KS statistic) and will have fewer False positives when compared to KS

  • Use KL if you expect fluctuations in your data

  • Sample size has less of an effect on KL

  • Binning Strategy can affect results

  • The non-symmetric version of PSI

    • KL(A -> B) != KL(B->A)

Similar to Kullback-Leibler divergence but has two distinct advantages: it is always finite and symmetric. It offers an interpretable score ranging from 0, indicating identical distributions, to 1, indicating completely different distributions with no overlap. While its sensitivity is moderate compared to PSI and KL and less than KS, its results can still be influenced by the chosen binning strategy.

  • Similar to KL except in two areas: JS is always finite and symmetric

  • Interpretable from 0 --> 1 (PSI doesn't have this property as it's evaluated from 0 --> infinity)

  • 0 = identical distributions

  • 1 = completely different with no overlap

  • Mildly sensitive compared to PSI and KL, but not as sensitive as KS

  • Binning strategy can affect results

A non-parametric metric that does not require assumptions about the underlying data or binning for its calculation, making it a sensitive tool for detecting drift, even in large datasets. The return of a smaller p-value from KS signifies a more confident drift detection, though this sensitivity may also result in more false positives. This sensitivity enables it to detect even slight differences in data distribution.

  • Non-parametric, so it doesn't make assumptions about the underlying data

  • It doesn't require binning to calculate, so binning strategy doesn't affect this metric

  • A smaller P-value means more confident drift detection

  • KS Statistic returns P-value

  • KS is the most sensitive metric among all the drift metrics

  • Larger datasets make KS increasingly more sensitive

  • Will produce more false positives

  • Detects very slight differences

Data Quality Monitor Metrics

MetricDescription

Percent Empty

The percent of nulls in your model features. Percent empty for

list of strings will count both empty lists and NULL values. A high percentage can significantly influence model performance and a sudden increase in null values could indicate a problem in the data collection or preprocessing stages.

Cardinality (Count Distinct)

The cardinality of your categorical features. Changes in your feature cardinality could indicate a change in the feature pipeline, or a new or deprecated product feature that your model has not adapted to yet.

Count of new unique values that appear in production but not in the baseline. Identify concept drift or changes in the data distribution over time. These new unique values may not have been accounted for during model training and therefore could lead to unreliable predictions.

Count of new unique values that appear in the baseline but not in production. Can indicate changes in data generation processes or an issue with data collection in the production environment.

p99.9, p99, p95, p50 A detailed understanding of the underlying statistical properties of the data and its spread. Any significant shift in these quantiles could indicate a change in the data distribution, and require retraining.

Sum

The sum of your numeric data over the evaluation window. Detect anomalies or shifts in the data distribution. Significant changes in the sum might indicate data errors, outliers, or systemic changes in the process of generating the data.

Count

Traffic count of predictions, features, etc. Can be used with filters. Ensure aren't any unexpected surges or drops in traffic that could affect performance and provide valuable insights about usage patterns, for better resource management and planning.

Average

Average of your numeric data over the evaluation window. May indicate a systematic bias, a change in the data collection process, or an introduction of anomalies, which can adversely impact the performance and signal when your model may need retraining.

Average List Length / Average Vector Length

This metric calculates the average of all the list lengths from each row and is available only for the list of string data type.

Note: This metric omits empty lists or NULL values as missing values are captured in the percent empty metric.

Custom Metrics

Couldn't find your metric above? Arize supports the ability to monitor custom metrics using SQL. Here is an example of a custom metric for the percent of a loan that is outstanding:

SELECT
SUM(loan_amount - repayment_amount) / SUM(loan_amount)
FROM model
WHERE state = 'CA'
AND loan_amount > 1000

Learn how to create custom metrics here.

Last updated

Copyright © 2023 Arize AI, Inc