Model Metric Definitions

Definitions for common model metrics


Accuracy is the measure of the number of correct predictions made by the model. It is derived by calculating the percentage of correct predictions out of overall predictions.
Accuracy = correct predictions / all predictions
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent. If your model predicts that 95 transactions are legitimate and 5 transactions are fraudulent, its accuracy is:
95% = (90 correct legitimate + 5 correct fraudulent) / 100 transactions


Percentiles help you understand your data quality to account for outlier events and gain a more representative understanding of your data. The Arize platform supports P50, P95, and P99 for data quality monitors:
  • P50 - Median data performance
  • P95 & P99 - Outlier data performance


Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class
precision = true positives / (predicted true positives + predicted false positives)
Example: There are 100 credit card transactions; 80 transactions are legitimate (positive class) and 20 transactions are fraudulent. If your model predicts that 85 transactions are legitimate, its precision is:
94.12% = 80 true positives / (80 true positives + 5 false positives)


Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives)
recall = predicted true positives / (true positives + false negatives)

KL Divergence

The Kullback-Leibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution. KL divergence is sometimes referred to as ‘relative entropy’ and is best used when one sample distribution is much smaller and has large variance.

JS Distance

JS distance is a symmetric derivation of KL divergence, and it is used to measure drift. In addition to being an actual metric (as opposed to KL), it is bounded by
. For two distributions P and Q, the formula for JS distance is shown below. Use JS distance to compare distributions with low variance.

KS Statistic

KS test statistic is a drift measurement that quantifies the maximum distance between two cumulative distribution functions. KS test is an efficient and general way to measure if two distributions significantly differ from one another.
PSI and rank ordering tests focus more on how to population may have shifted between development and validation periods, while KS statistic is used to assess the predictive capability and performance of the model.

Population Stability Index (PSI)

Population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time. PSI is calculated as:
PSI = (% Actual - % Expected) x ln(% Actual / % Expected)
The larger the PSI, the less similar your distributions are, which allows you to set up thresholding alerts on the drift in your distributions. PSI is a great metric for both numeric and categorical features where distributions are fairly stable.


Root Mean Square Error (also known as root mean square deviation, RMSD) is a measure of the average magnitude of error in quantitative data predictions. It can be thought of as the normalized distance between the vector of predicted values and the vector of observed (or actual) values.
Because errors are squared before averaged, this measure gives higher weight to large errors, and therefore useful in cases where you want to penalize models accordingly.


Mean absolute scaled error (MASE) is an accuracy metric for forecasting. It is the mean absolute error of the forecast values, normalized by the naive forecast. The naïve forecast refers to a simple forecasting method that uses the demand value of a previous time point as the forecast for the next time point. A lower MASE is considered to have higher accuracy.


Mean Square Error a regressive loss measure. The MSE is measured as the difference between the model’s predictions and ground truth, squared and averaged out across the dataset. It is used to check how close the predicted values are to the actual values. As in RMSE, a lower value indicates a better fit, and it heavily penalizes large errors or outliers.


Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset. Unlike MSE, MAE is weighted on a linear scale and therefore doesn’t put as much weight on outliers. This provides a more even measure of performance, but means large errors and smaller errors are weighted the same. Something to consider depending on your specific model use case.


Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE. MAPE measures the average magnitude of error produced by a model, or how far off predictions are on average. See MAE for considerations when using this metric.


NDCG measures a model's ability to rank query results in the order of the highest relevance. Actual relevance scores are usually determined by user interaction. For example, if users tend to click on results ranked high on the list, then the NDCG value will be high. Conversely, if users tend to click on query results that are ranked low on the list, it would mean that the ranking model is doing poorly, and the NDCG value will be low. NDCG values range between 0 and 1 with 1 being the highest. Arize computes NDCG using the standard log2 discount function.


Measure of the harmonic mean of precision and recall. F-score is a result of integrating these parameters into one for a better understanding of the accuracy of the model. F-score can be modified into F, 0.5, 1, & 2 based on the measure of weightage given to precision over recall.

Logarithmic Loss

Tracks incorrect labeling of the data class by the model and penalises the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values.

Performance Impact Score

Performance impact score is a measure of how much worse your metric of interest is on the slice compared to the average.

PR Curve

The Precision-Recall curve is the correlation between the precision and recall at particular cut-off values, with the cut off values being set according to the particular model.


The Receiver Operating Characteristics (ROC) is a probability curve plotted between true positive rate (TPR) and false positive rate (FPR). Area Under the Curve (AUC) is an aggregate measure of performance across all possible classification thresholds. Together, ROC - AUC represents the degree of separability, or how much a model is capable of distinguishing between classes.
The higher the AUC (i.e. closer to 1), the better the model is at predicting 0 class as 0, and 1 class as 1.
FPR = X = fp / ( tn + fp) => error: when all actuals are negative
TPR = Y = tp / (tp + fn) => error: when all actuals are positive
To calculate AUC/PR AUC: Threshold: We first need to generate a confusion matrix based on the threshold. There are 20 thresholds by default , so there will be 20 confusion matrices as well.
From each confusion matrix: we calculate the False Positives Rate (x-axis) and True Positive Rate (y-axis) for AUC. As a result, we will have a set of 20 points (x and y coordinates) for AUC or PR-AUC.
AUC: The next step is to calculate the area under the curve (AUC) for the 20 points. We first sort the points in order of x-axis increasing, and then a secondary sort on y-axis to make it consistent between runs. According to trapezoidal rule, we find the difference between x values of two consecutive points as delta x, and multiply by the average of y values of two consecutive points to get the area between two consecutive points. Lastly, we do the cumulative sum for all 20 points to get the overall AUC.

Group AUC

Group AUC (gAUC) can be used to evaluate the performance of a ranking model in a group setting. A ranking model assigns a score or rank to each item in a dataset, and the goal is to correctly rank items within groups, rather than just items.
An example of a use case of gAUC for ranking is a recommendation system where the model is trying to recommend movies to users, but the movies are grouped based on their genre. The gAUC would measure the performance of the ranking model for each group separately and then average the AUCs across groups. This allows you to evaluate the performance of the ranking model for different genres and to detect if the ranking model has any bias towards certain genres.
A gAUC of 1 would indicate perfect performance for all groups and a gAUC of 0.5 would indicate a performance no better than random guessing for all groups. A value less than 0.5 would indicate a negative bias for certain groups.
It is important to note that for ranking problem the AUC metric is calculated by comparing the predicted rank and not the binary classification.


Sensitivity is a measure of the number of positive cases that turned out to be true for a given model. It is also called the true positive rate.
sensitivity = predicted true positives / (true positives + false negatives)


Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives). This measure is similar to recall, except it describes the offset in correcting predicted negative values. It is also called the true negative rate.
specificity = predicted true negatives / (true negatives + false positives)
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent (negative class). If your model predicts that 20 transactions are fraudulent, its recall is:
50% = 10 true negatives / (10 true negatives + 10 false positives)


Calibration is the comparison of the actual output and the expected output given by a system.
Example: Let's say we are predicting credit card fraud and produce the following prediction/actual pairs (0.1, 0.1), (0.2, 0.1), (0.9, 0.7).
Our calibration calculation is as follows:
calibration = average prediction / average actual
= ((0.1 + 0.2 + 0.9)/3) / ((0.1 + 0.1 + 0.7)/3) = 0.4 / 0.3
= 1.333
In this case we see we are on average predicting higher (more fraud) than the ground truth.


Weighted Average Percentage Error, also also referred to as the MAD/Mean ratio. The WAPE metric is the sum of the absolute error normalized by the sum of actual values. WAPE equally penalizes for under-forecasting or over-forecasting, and does not favor either scenario.
WAPE = sum(absError)/sum(actuals)
When the total number of sales can be low or the product analyzed has intermittent sales, WAPE is recommended over MAPE. MAPE is commonly used to measure forecasting errors, but it can be deceiving when sales reach numbers close to zero, or in intermittent sales (referenced here). WAPE is a measure that counters this by weighting the error over total sales. WAPE is more robust to outliers than Root Mean Square Error (RMSE) because it uses the absolute error instead of the squared error.

Cardinality - New Values and Missing Values

Our new values and missing values metrics operate like set subtraction. For new values, the metric is the result of the set of production values minus baseline values. Example:
Let's assume you have a production and baseline datasets with these unique values:
production = {'a', 'b', 'c'}
baseline = {'c', 'd'}
To calculate the new values - you do a set subtraction of production minus baseline. The actual metric value is the length of the result, although we also show the actual values for debugging.
new_values = production - baseline = {'a', 'b'}
number_of_new_values = len(new_values) = 2
For missing values, the result is baseline minus production.
missing_values = baseline - production = {'d'}
number_of_missing_values = len(missing_values) = 1
This metric is useful for capturing data inconsistencies and is more useful than simply cardinality. For example, a feature like state could have all uppercase values in training, like AL, AK, AZ, ..., whereas in production the values may all be lowercase: al, ak, az, .... Simply monitoring the cardinality wouldn't work, as both datasets will have 51 unique values. However, the new values metric will detect 51 new values between production and baseline, as well as 51 missing values between production and baseline.
MAP (Mean Average Precision) @K is a metric used to evaluate the performance of a ranking model. MAP weighs errors to account for value differences between the top and bottom of the list but is limited to binary relevancy (relevant/non-relevant) and can not account for order-specific details.
The higher the [email protected] score, the better the ranking algorithm or recommendation system performs.
Precision is the fraction of relevant items among the total number of items returned by the system. Average precision is the average of the precision values at each position where a relevant item is retrieved.
MAP @ K=5, calculation for a ranking model across 3 searches

MRR (Mean Reciprocal Rank)

MRR (Mean Reciprocal Rank) is a metric used to evaluate the performance of a ranking model. MRR is the summation of relevant predictions within a list divided by the total number of recommendations.
MRR calculates the mean of the first relevant recommendation, evaluating how well your algorithm predicts your first relevant item.