Definitions of common terms in data science and ML monitoring.
Accuracy is the measure of the number of correct predictions made by the model. It is derived by calculating the percentage of correct predictions out of overall predictions. accuracy = correct predictions / all predictions
Accuracy = Correct Predictions / All Predictions
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent. If your model predicts that 95 transactions are legitimate and 5 transactions are fraudulent, its accuracy is:
95% = (90 correct legitimate + 5 correct fraudulent) / 100 transactions
A baseline is the reference data or benchmark used to compare model performance against for monitoring purposes. Baselines can be training data, validation data, prior time periods of production data, a previous model version, among others.
A baseline distribution refers to a model dataset used as a reference or comparison to the model’s current (production) distribution. Baseline distributions in Arize AI’s platform can be training datasets, validation datasets, or prior time periods of production.
Binary classification refers to machine learning algorithms that have classification tasks that have only and only two class labels. Binary classification involves one “positive” and one “negative” class state in general.
Binning is a way to group a number of continuous values together into smaller cohorts or “bins”. The technique helps reduce the cardinality of data by representing the points in intervals -- for example, age ranges.
A confusion matrix provides a summary of all prediction results of a classification problem. Each result is shown with its corresponding number of correct/incorrect predictions (see True Positive, True Negative, False Positive, False Negative), count values and classification criteria. By providing a neat summary of all possible results, the confusion matrix lets you know the ways your classification model could get confused when making the predictions. It helps identify errors and the type of errors made by the model and thus helps improve the accuracy of the classification model.
A method of testing a new model or model version where only a small subset of production data flows through this model to verify response performance before making a complete cutover. This technique allows for deeper analysis and understanding of model behavior and can minimize risk that regressions severely impact the business or customers.
Classification models are used to predict categories or assign a class label. Any given data is classified into a set of categories or groups to determine its further use or for processing needs.
Data that can be classified into one category or a second category is known as binary data. For example, fraud or not fraud, male or female.
If the set of data can be classified into a number of categories or groups, each based on a different criterion, such data is known as multi-class data. For example, education level, household income.
Current Distribution refers to the statistical distribution, or shape, of the dataset being generated by a machine learning model in production. Distribution of datasets in machine learning models are represented in the form of functions that show the relationships between the various observations, visually presented in the form of curves or graphs.
Data quality refers to the integrity and consistency of the data sets used. In monitoring machine learning performance, data quality measures include attributes such as missingness, out of range, P1 and P99, type mismatch, among others.
Machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications. Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems inspired by the human brain’s network of neurons.
A deep learning model normally refers to a neural network, typically with more than two layers). Deep learning is usually used for computationally dense tasks like computer vision (images) and natural language processing.
Disparate Impact (also called adverse impact) is a quantitative measure of the adverse treatment of protected classes. The calculation is the proportion of the sensitive group that received the positive outcome divided by the proportion of the base group that received the positive outcome.
Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well. (source: Wikipedia)
Drift is defined as the change in the data over time. It also means the change in the properties of the target variable, due to unpredictable or unforeseen changes, over the due course of time.
- Data drift can be described as the change in the distribution of data, between the real-time data and the baseline data that was predicted or set beforehand.
- Concept drift is the change between the relationship between input and the output given in any situation.
Drift can be in any form. It can be gradual, recurring, or sudden. It can be a positive or negative drift. The change in data over time can affect model outcomes, making drift an important metric to monitor when it comes to model performance.
In natural language processing (see definition of ‘natural language processing), embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
Euclidean distance is calculated as the square root of the sum of the squared differences between the components of two vectors.
We can use Euclidean distance (like in the image above), to compare two sets of embeddings. We can calculate the centroids of the two sets to see if one centroid may be drifting relative to the other using the Euclidean distance.
An example for Euclidean distance between two centroids, x and y, is given below for 2D, 3D, and N-dimensional vectors:
The way the performance of a predictive model is quantified and calculated is known as the evaluation metric. It is used to evaluate the accuracy and the performance of the model used.
A machine learning infrastructure tool to monitor and improve model performance. Think of them as the ledger or log of model activities/inferences. Evaluation stores are used to:
- Surface up performance metrics in aggregate (or slice) for any model, in any environment — production, validation, training
- Monitor and identify drift, data quality issues, or anomalous performance degradations using baselines
- Enable teams to connect changes in performance to why they occurred
- Provide a platform to help deliver models continuously with high quality and feedback loops for improvement — compare production to training
- Provide an experimentation platform to A/B test model versions
The evaluation window is plot of the period or duration of time against the metric being calculated. For instance, the previous 30 days. Any evaluation metric that can be represented as a duration of time can be visualised as an evaluation window.
It is defined as the total extent to which the machine learning internal mechanics can be explained in human-understandable terms only. It is simply the process of explaining the reasons behind the machine learning aspects of output data (see definition, ‘SHAP’).
When a model mistakenly predicts a negative class, when the value belongs to the positive class.
Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a fraudulent transaction.
When a model mistakenly predicts a positive class, when the value belongs to the negative class.
Example: A model flags a credit card transaction as ‘fraud’ when it was not actually a fraudulent transaction.
False Positive Rate (FPR) is a the rate of "false alarms". In other words, it is a measure of the probability a true negative will be missed.
Feature is a term of art for inputs into a model. The label can be used to mean a couple things and are a bit more fluid in use. Normally a label is a target of a prediction but can sometimes be the output of a model. A target of a prediction also can be known as ground truth.
Feature importance is a compilation of a class of techniques that take in all the features related to making a model prediction and assign a certain score to each feature to weigh how much or how little it impacted the outcome. These scores can then be used to better understand the internal logic of a model, make necessary changes to the model to improve its accuracy, and also reduce unnecessary inputs.
A feature performance heat map is a visual representation of the performance of each feature in a given model. It enables users to quickly see slices of performance or features that perform significantly better or worse than others for faster triangulation of issues. Heat maps are especially useful when troubleshooting.
A machine learning infrastructure tool that handles offline and online feature transformations. Think of them as the interface between your models and data. Feature stores are used to:
- Serve as the central source for feature transformations
- Allow for the same feature transformations to be used in both offline training and online serving
- Enable team members to share their transformations for experimentation
- Provide a strong versioning for feature transformation code
Measure of the harmonic mean of precision and recall. F-score is a result of integrating these parameters into one for a better understanding of the accuracy of the model. F-score can be modified into F, 0.5, 1, & 2 based on the measure of weightage given to precision over recall.
The Kullback-Leibler Divergence metric is calculated as the difference between one probability distribution from a reference probability distribution. KL divergence is sometimes referred to as ‘relative entropy’ and best used when one distribution is much smaller in sample and has a large variance.
LIME, or “Local Interpretable Model-Agnostic Explanations,” is an explainability method that attempts to provide local ML explainability. At a high level, LIME attempts to understand how perturbations in a model’s inputs affect the end-prediction of the model. Since it makes no assumptions about how the model reaches the prediction, it can be used with any model architecture, hence the “model-agnostic” part of LIME. The LIME explainability approach takes a single input value of predictions and perterbs the inputs around those values. It then builds a linear model off of the feature perturbations where the coefficients are the feature importances at this local prediction.
Tracks incorrect labelling of the data class by the model and penalises the model if deviations in probability occur into classifying the labels. Low log loss values equate to high accuracy values.
Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model’s predictions and ground truth, averaged out across the dataset. Unlike MSE, MAE is weighted on a linear scale and therefore doesn’t put as much weight on outliers. This provides a more even measure of performance, but means large errors and smaller errors are weighted the same. Something to consider depending on your specific model use case.
Mean Absolute Percentage Error is one of the most common metrics of model prediction accuracy and the percentage equivalent of MAE. MAPE measures the average magnitude of error produced by a model, or how far off predictions are on average. See MAE for considerations when using this metric.
Mean absolute scaled error (MASE) is an accuracy metric for forecasting. It is the mean absolute error of the forecast values, normalized by the naive forecast. The naïve forecast refers to a simple forecasting method that uses the demand value of a previous time point as the forecast for the next time point. A lower MASE is considered to have higher accuracy.
Mean Square Error a regressive loss measure. The MSE is measured as the difference between the model’s predictions and ground truth, squared and averaged out across the dataset. It is used to check how close the predicted values are to the actual values. As in RMSE, a lower value indicates a better fit, and it heavily penalizes large errors or outliers.
The performance of a machine learning model indicates its usability and ability to provide accurate results. Performance is usually measured in terms of metrics that apply to the specific type of machine model concerned. Here are some common metrics used according to the type of machine model:
- Regression based machine learning models - MSPE, MSAE, R Squared and Adjusted R Squared
- Classification - Precisions-Recall, ROC-AUC, Accuracy, log-loss
- Unsupervised models - Rand index, Mutual information
A machine learning infrastructure tool that serves as central model registry and track experiments. Think of them as the library or catalog of your models. Model stores are used to:
- Serve as a central repository of all models and model versions
- Allow for reproducibility of every model version
- Track lineage of models history
Monitor threshold refers to the value set for a model monitor, beyond which the model’s monitoring status will be triggered accordingly. The threshold value can be set on any specific performance metric such as accuracy, MSE, MAPE, etc.
NDCG measures a model's ability to rank query results in the order of the highest relevance. Actual relevance scores are usually determined by user interaction. For example, if users tend to click on results ranked high on the list, then the NDCG value will be high. Conversely, if users tend to click on query results that are ranked low on the list, it would mean that the ranking model is doing poorly, and the NDCG value will be low. NDCG values range between 0 and 1 with 1 being the highest. Arize computes NDCG using the standard log2 discount function.
Natural language processing (NLP). The inputs to these models are typically sentences, such as: “This definition is so informative.” These inputs are broken up into tokens: “This” “definition” “is” “so” “informative.” Most commonly, a classification model runs on top of NLP.
Performance Impact Score indicates how a particular slice is performing relative to the average performance, weighted by volume. This is a proprietary metric created by Arize.
Slice Performance Impact = (Slice Performance - Overall Model Performance) * % of VolumeThe idea here is that you want to understand which slices are performing best or worst, and weight this difference by volume. Performance is weighted by volume because the more volume a particular slice is seen, the more of an impact it can have on the overall performance.
A performance slice is a subset of model values of interest in performance analysis and troubleshooting. Slices can be formed from any model dimension, including specific periods of time, set of features, etc. Performance slice analysis is useful when the goal is to understand or troubleshoot a cohort of interest, such as with bias detection, where the generalized dataset might mask statistical nuances.
Population Stability Index looks at the magnitude which a variable has changed or shifted in distribution between two samples over the course of a given time. PSI is calculated as:
PSI = (%Production - %Baseline) x ln(%Production / %Baseline)
The larger the PSI, the less similar your distributions are, which allows you to set up thresholding alerts on the drift in your distributions. PSI is a great metric for both numeric and categorical features where distributions are fairly stable.
Precision is the fraction of values that actually belong to a positive class out of all the values which were predicted to belong to that class
Precision = True Positives / (Predicted True Positives + Predicted False Positives)
Example: There are 100 credit card transactions; 80 transactions are legitimate (positive class) and 20 transactions are fraudulent. If your model predicts that 85 transactions are legitimate, its precision is:
94.12% = 80 true positives / (80 true positives + 5 false positives)
The product of feature importance and drift (PSI -- population stability index).
Prediction Drift Impact = Feature Importance * Drift
The Precision-Recall curve is the correlation between the precision and recall at particular cut-off values, with the cut off values being set according to the particular model.
Quantiles are the points dividing the range of a probability distribution into intervals with equal probabilities.
Recall is the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives)
recall = predicted true positives / (true positives + false negatives)
The Receiver Operating Characteristics (ROC) is a probability curve plotted between true positive rate (TPR) and false positive rate (FPR). Area Under the Curve (AUC) is an aggregate measure of performance across all possible classification thresholds. Together, ROC - AUC represents the degree of separability, or how much a model is capable of distinguishing between classes.
The higher the AUC (i.e. closer to 1), the better the model is at predicting 0 class as 0, and 1 class as 1.
Regression analysis is a fundamental concept in data science and machine learning. It helps quantify the relationship between the inputs into a model and its outputs. Essentially, it is an estimation of how a variable affects a set of independent variables.
Root Mean Square Error (also known as root mean square deviation, RMSD) is a measure of the average magnitude of error in quantitative data predictions. It can be thought of as the normalized distance between the vector of predicted values and the vector of observed (or actual) values.
Because errors are squared before averaged, this measure gives higher weight to large errors, and therefore useful in cases where you want to penalize models accordingly.
Score models generate a numeric value as its prediction or output. For example, the likelihood that an input belongs to a category.
Sensitivity is a measure of the number of positive cases that turned out to be true for a given model. It is also called the true positive rate.
sensitivity = predicted true positives / (true positives + false negatives)
A method of testing a candidate model for production where production data runs through the model without the model actually returning predictions to the service or customers. Essentially, simulating how the model would perform in the production environment.
SHAP stands for “Shapley Additive Explanations,” a concept derived from game theory and used to explain the output of machine learning models (see definition of ‘Explainability’). SHAP values help interpret how much a given feature or input contributes, positively or negatively, to the target outcome or prediction. See ‘Feature Importance’
Symmetric Mean Absolute Percentage Error (sMAPE) is an accuracy metric based on percentage. By dividing by both actual and predicted values and normalizing the relative errors, sMAPE overcomes the asymmetric shortcomings found in MAPE. sMAPE ranges with a lower bound of 0% and an upper bound of 200% (in the Arize platform this is reflected from 0-->2), which enables models that have forecasts higher than actuals to attain more accurate negative percentage error approximations.
Specificity is the fraction of values predicted to be of a negative class out of all the values that truly belong to the negative class (including false positives). This measure is similar to recall, but describes the offset in correcting predicting negative values. It is also called the true negative rate.
specificity = predicted true negatives / (true negatives + false positives)
Example: There are 100 credit card transactions; 90 transactions are legitimate and 10 transactions are fraudulent (negative class). If your model predicts that 20 transactions are fraudulent, its recall is:
50% = 10 true negatives / (10 true negatives + 10 false positives)
Data in a table format, with columns and rows. Inputs of the model in a table format (i.e. an Excel spreadsheet), where columns might be feature inputs (i.e city, state, charge amount). NLP and images do not fit in an excel sheet, since inputs are sentences or images.
A tag is used to store extra information or metadata alongside a prediction. They are different than features in that they are not actual inputs in the model.
When a model correctly predicts a negative class, when the value belongs to the negative class.
Example: A model flags a credit card transaction as ‘not fraud’ when it is actually a legitimate transaction.
When a model correctly predicts a positive class, when the value belongs to the positive class.
Example: A model flags a credit card transaction as ‘fraud’ when it is actually a fraudulent transaction.
Weighted Average Percentage Error, also referred to as the MAD/Mean ratio. The WAPE metric is the sum of the absolute error normalized by the sum of actual values. WAPE equally penalizes for under-forecasting or over-forecasting, and does not favor either scenario.
WAPE = sum(absError)/sum(Actuals)
When the total number of sales can be low or the product analyzed has intermittent sales, WAPE is recommended over MAPE. MAPE is commonly used to measure forecasting errors, but it can be deceiving when sales reach numbers close to zero, or in intermittent sales (referenced here). WAPE is a measure that counters this by weighting the error over total sales. WAPE is more robust to outliers than Root Mean Square Error (RMSE) because it uses the absolute error instead of the squared error.
When using the worst performing slice tool inside of Arize, we can understand which slices have the most affect on the our overall performance. Here we can see the percentage difference between the slice and our overall model performance divided by the overall performance. Note this calculation is accounting for volume of particular slices inherently.
Worst Performing Slice = (Performance Excluding Slice - Overall Performance) / Overall Performance) * 100