ML Monitor Types
Learn about the different monitor options in Arize
Performance Monitor Metrics by Model Type
Model performance metrics measure how well your model performs in production. Monitor model performance with daily or hourly checks using an evaluation metric. Your model type determines your performance metric.
Classification
Accuracy, Recall, Precision, FPR, FNR, F1, Sensitivity, Specificity
Regression
MAPE, MAE, RMSE, MSE, R-Squared, Mean Error
Ranking
NDCG@k, AUC@k
Ranking Labels
MAP@k, MRR
AUC/LogLoss
AUC, PR-AUC, Log Loss
Computer Vision/ Object Detection
Accuracy, MAP, IoU
Custom Metrics
Not seeing what you're looking for? Create a metric yourself!
Drift Monitor Metrics
Arize offers various distributional drift metrics to choose from when setting up a monitor. Each metric is tailored to a specific use case; refer to this guide to help choose the appropriate metric for various ML use cases.
A metric that is less influenced by sample size and offers fewer false positives compared to the Kolmogorov-Smirnov test or Earth Mover's Distance, making it suitable for datasets with expected fluctuations. However, PSI can be affected by the chosen binning strategy. A notable attribute of PSI is its symmetry, confirming its status as a true statistical 'distance'.
Sample size has less of an effect on PSI
Less sensitive, but will have fewer False positives when compared to KS or EMD (use PSI if you expect fluctuations in your data and don’t want too many false alarms)
Binning Strategy can affect the calculation of PSI
A true statistical ‘distance’, having the property of symmetry
PSI(A -> B) == PSI(B->A)
Euclidean distance check determines if the group of production data’s average centroid has moved away from the baseline group For unstructured data types, learn more here.
A metric that's less sensitive than others like the Kolmogorov-Smirnov statistic, thereby producing fewer false positives and making it appropriate for datasets with expected fluctuations. While its calculation can be influenced by the chosen binning strategy, it's less affected by sample size. Unlike PSI, KL divergence is non-symmetric, meaning the divergence from dataset A to B is not the same as from B to A.
Less sensitive than other metrics (such as KS statistic) and will have fewer False positives when compared to KS
Use KL if you expect fluctuations in your data
Sample size has less of an effect on KL
Binning Strategy can affect results
The non-symmetric version of PSI
KL(A -> B) != KL(B->A)
Similar to Kullback-Leibler divergence but has two distinct advantages: it is always finite and symmetric. It offers an interpretable score ranging from 0, indicating identical distributions, to 1, indicating completely different distributions with no overlap. While its sensitivity is moderate compared to PSI and KL and less than KS, its results can still be influenced by the chosen binning strategy.
Similar to KL except in two areas: JS is always finite and symmetric
Interpretable from 0 --> 1 (PSI doesn't have this property as it's evaluated from 0 --> infinity)
0 = identical distributions
1 = completely different with no overlap
Mildly sensitive compared to PSI and KL, but not as sensitive as KS
Binning strategy can affect results
A non-parametric metric that does not require assumptions about the underlying data or binning for its calculation, making it a sensitive tool for detecting drift, even in large datasets. The return of a smaller p-value from KS signifies a more confident drift detection, though this sensitivity may also result in more false positives. This sensitivity enables it to detect even slight differences in data distribution.
Non-parametric, so it doesn't make assumptions about the underlying data
It doesn't require binning to calculate, so binning strategy doesn't affect this metric
A smaller P-value means more confident drift detection
KS Statistic returns P-value
KS is the most sensitive metric among all the drift metrics
Larger datasets make KS increasingly more sensitive
Will produce more false positives
Detects very slight differences
Data Quality Monitor Metrics
Percent Empty
The percent of nulls in your model features. Percent empty for
list of strings
will count both empty lists and NULL values.
A high percentage can significantly influence model performance and a sudden increase in null values could indicate a problem in the data collection or preprocessing stages.
Cardinality (Count Distinct)
The cardinality of your categorical features. Changes in your feature cardinality could indicate a change in the feature pipeline, or a new or deprecated product feature that your model has not adapted to yet.
Count of new unique values that appear in production but not in the baseline. Identify concept drift or changes in the data distribution over time. These new unique values may not have been accounted for during model training and therefore could lead to unreliable predictions.
Count of new unique values that appear in the baseline but not in production. Can indicate changes in data generation processes or an issue with data collection in the production environment.
p99.9, p99, p95, p50 A detailed understanding of the underlying statistical properties of the data and its spread. Any significant shift in these quantiles could indicate a change in the data distribution, and require retraining.
Sum
The sum of your numeric data over the evaluation window. Detect anomalies or shifts in the data distribution. Significant changes in the sum might indicate data errors, outliers, or systemic changes in the process of generating the data.
Count
Traffic count of predictions, features, etc. Can be used with filters. Ensure aren't any unexpected surges or drops in traffic that could affect performance and provide valuable insights about usage patterns, for better resource management and planning.
Average
Average of your numeric data over the evaluation window. May indicate a systematic bias, a change in the data collection process, or an introduction of anomalies, which can adversely impact the performance and signal when your model may need retraining.
Average List Length / Average Vector Length
This metric calculates the average of all the list lengths from each row and is available only for the list of string
data type.
Note: This metric omits empty lists or NULL values as missing values are captured in the percent empty metric.
Custom Metrics
Couldn't find your metric above? Arize supports the ability to monitor custom metrics using SQL. Here is an example of a custom metric for the percent of a loan that is outstanding:
Learn how to create custom metrics here.
Last updated