Arize AI
Overview of how to use Arize for fraud models
Check out our Credit Card Fraud Colab for an interactive demo, our best practices blog for additional industry context, and our fraud webinar to see how you can leverage ML Observability for your fraud models.


In this walkthrough, we will use the Arize platform to monitor a Fraud Detection model's performance.
This use case will walk you through how to manage your model's performance for your fraud detection models. While you may have found optimal model performance in the training environment after spending a great deal of time collecting data and training your model, it's common that your model will perform worse in the production environment. With a limited availability of tools to monitor your model performance, identify root cause issues, or gain insights into how to actively improve your model, troubleshooting your model performance issues can prove to be troubling.
In this walkthrough, we will look at a few scenarios common to a fraud model when monitoring for performance.
You will learn to:
  1. 1.
    Get training, validation, and production data into the Arize platform
  2. 2.
    Setup performance monitors for False Negative Rate, False Positive Rate, and Accuracy
  3. 3.
    Create customized dashboards
  4. 4.
    Discover the root cause of issues
  5. 5.
    Identify business impact

Set up a Baseline

Set up a baseline for easy comparison of your model's behavior in production to a training set, validation set, or an initial model launch period. Specifically, we want to see when/where our model's behavior has changed drastically. Setting up a baseline will help identify potential model issues, changing user behaviors, or even changes in our model's concepts (what people consider fraud might change over time).
Setting up a baseline will surface changes in the distributions of:
  1. 1.
    Model predictions —fraud / non fraud
  2. 2.
    Model input data — feature values
  3. 3.
    Ground truth/Actual values — was this transaction actually fraud

Choosing a baseline

In the case of our credit card fraud model, we have a training environment that we want to compare production to. Therefore we are setting up the baseline by selecting pre-prod training.
With fraud detection, often times your training set may contain an exaggerated amount of fraud for the purpose of training your model. However, in the real world, credit card fraud models see less than 1% fraud transactions. If this is true for your training environment, you may want to select an initial model launch period from production as your baseline (assuming that period of time looks as expected).


Arize sets up monitors across all features, predictions, and actual values. In fraud detection it's important to monitor your model's:
  1. 1.
    False Negative Rate — Chargeback % (fraud transactions that were identified by the model as non fraud leading to a chargeback/immediate financial loss for the company).
  2. 2.
    False Positive Rate — Upset Customer % (non fraud transactions that were classified as fraud leading to an awkward moment at the register and an upset customer).
  3. 3.
    Accuracy — of all my predictions, what percent did the model predict correctly? We need to be careful with this metric as it can be potentially misleading, especially if there is a small amount of fraud transactions. If a model has 1% fraud, misclassifying all transactions can still result in 99% accuracy.

Drift Detection

Drift is a change in distribution over time, measured for model inputs, outputs, and actuals of a model. Measure drift to identify if your models have grown stale, you have data quality issues, or if there are adversarial inputs in your model. Detecting drift in your models will help protect your models from performance degradation and allow you to better understand how to begin resolution.
Type of Drift
Possible Drift Correlation
Prediction Drift
Influx of fraud predictions could mean that your model is under attack. You are classifying a lot more fraud than what you expect to see in production, but (so far) your model is doing a good job at catching this. Let's hope it stays that way.
Actual Drift (No Prediction Drift)
Influx of fraud actuals without changes to the distribution of your predictions means that fraudsters have found an exploit in your model and that they're getting away with it! Troubleshoot and fix your model ASAP to avoid any more costly chargebacks.
Feature Drift
Influx of new and/or existing feature values could be an indicator of seasonal changes (tax or holiday season) or, in the worst case, be correlated with a fraud exploitation. Use drift over time, stacked on top of your metric over time graph, to validate whether there is any correlation.
Prediction Drift Impact can surface when drift has impacted your model. Drift (PSI) is a measurement of how much your distribution has drifted. Lastly, Feature Importance helps your explain why even small Drift (PSI) can have significant Drift Impact.

Data Quality Checks

It’s important to immediately surface data quality issues to identify how your data quality maps to your model’s performance. Utilize data quality monitoring to analyze hard failures in your data quality pipeline, such as missing data or cardinality shifts.
  • Missing / Null values could be an indicator of issues from an upstream data source.
  • Cardinality is checked to ensure there are no spikes / drops in feature values.

Performance Analysis

Model performance metrics measure how well your model performs in production. Once a performance monitor is triggered, navigate to the Performance tab to start troubleshooting your model issues and gain an understanding of what caused the degradation.
Compare production to training or other windows of production. Bring in another dataset to compare performance and see which model performs better. This can help answer questions such as "Were we seeing this problem in training?" or "Does my new / previous model version perform better?". It can also be helpful to compare to other windows of production.
Identify low performing segments. By looking at performance breakdown by feature, you can dig even deeper to see which segment within each feature of the model is underperforming.

Custom Dashboard

In the case of credit card fraud, the most important metrics to worry about are your False Positive, False Negative, and overall Accuracy rates. Note, Recall is the inverse of False Negative Rate if your positive class is Fraud.
In only a few clicks you can add widgets to provide a single glance view of your model's Accuracy, False Positive Rate, and False Negative Rate. To visualize these metrics over time you can also create a custom timeseries widget which overlays three plots to showcase the fluctuation of these metrics over time.

Business Impact

Sometimes, we need metrics other than traditional statistical measures to define model performance. Business Impact is a way to measure your scored model's payoff at different thresholds (i.e, decision boundary for a scored model).
When dealing with credit card fraud, often the profit/loss associated with model predictions is not weighted equal. For example, the diagram below might estimate the profit/loss of a decision made by your model.
Visualize the potential profit/loss based on these weighted decision values in Arize's Business Impact tab. Understand your business's overall profit/loss based on your model's prediction threshold for fraud classification.


Log feature importances to the Arize platform to explain your model's predictions. By logging these values, you gain the ability to view the global feature importances of your predictions as well as the ability to perform global and cohort prediction-based analysis to compare feature importances for your model's features.


Check out our Credit Card Fraud Colab for an interactive demo, our best practices blog for additional industry context, and our fraud webinar to see how you can leverage ML Observability for your fraud models.