Overview of how to use Arize for fraud models
In this walkthrough, we will use the Arize platform to monitor a Fraud Detection model's performance.
This use case will walk you through how to manage your model's performance for your fraud detection models. While you may have found optimal model performance in the training environment after spending a great deal of time collecting data and training your model, it's common that your model will perform worse in the production environment. With a limited availability of tools to monitor your model performance, identify root cause issues, or gain insights into how to actively improve your model, troubleshooting your model performance issues can prove to be troubling.
In this walkthrough, we will look at a few scenarios common to a fraud model when monitoring for performance.
You will learn to:
- 1.Get training, validation, and production data into the Arize platform
- 2.Setup performance monitors for False Negative Rate, False Positive Rate, and Accuracy
- 3.Create customized dashboards
- 4.Discover the root cause of issues
- 5.Identify business impact
Set up a baseline for easy comparison of your model's behavior in production to a training set, validation set, or an initial model launch period. Specifically, we want to see when/where our model's behavior has changed drastically. Setting up a baseline will help identify potential model issues, changing user behaviors, or even changes in our model's concepts (what people consider fraud might change over time).
Setting up a baseline will surface changes in the distributions of:
- 1.Model predictions —
- 2.Model input data — feature values
- 3.Ground truth/Actual values — was this transaction actually
In the case of our credit card fraud model, we have a training environment that we want to compare production to. Therefore we are setting up the baseline by selecting pre-prod training.
With fraud detection, often times your training set may contain an exaggerated amount of fraud for the purpose of training your model. However, in the real world, credit card fraud models see less than 1% fraud transactions. If this is true for your training environment, you may want to select an initial model launch period from production as your baseline (assuming that period of time looks as expected).
Arize sets up monitors across all features, predictions, and actual values. In fraud detection it's important to monitor your model's:
- 1.False Negative Rate — Chargeback % (
fraudtransactions that were identified by the model as
non fraudleading to a chargeback/immediate financial loss for the company).
- 2.False Positive Rate — Upset Customer % (
non fraudtransactions that were classified as
fraudleading to an awkward moment at the register and an upset customer).
- 3.Accuracy — of all my predictions, what percent did the model predict correctly? We need to be careful with this metric as it can be potentially misleading, especially if there is a small amount of fraud transactions. If a model has 1% fraud, misclassifying all transactions can still result in 99% accuracy.
Drift is a change in distribution over time, measured for model inputs, outputs, and actuals of a model. Measure drift to identify if your models have grown stale, you have data quality issues, or if there are adversarial inputs in your model. Detecting drift in your models will help protect your models from performance degradation and allow you to better understand how to begin resolution.
It’s important to immediately surface data quality issues to identify how your data quality maps to your model’s performance. Utilize data quality monitoring to analyze hard failures in your data quality pipeline, such as missing data or cardinality shifts.
- Missing / Null values could be an indicator of issues from an upstream data source.
- Cardinality is checked to ensure there are no spikes / drops in feature values.
Model performance metrics measure how well your model performs in production. Once a performance monitor is triggered, navigate to the Performance tab to start troubleshooting your model issues and gain an understanding of what caused the degradation.
Compare production to training or other windows of production. Bring in another dataset to compare performance and see which model performs better. This can help answer questions such as "Were we seeing this problem in training?" or "Does my new / previous model version perform better?". It can also be helpful to compare to other windows of production.
Identify low performing segments. By looking at performance breakdown by feature, you can dig even deeper to see which segment within each feature of the model is underperforming.
In the case of credit card fraud, the most important metrics to worry about are your False Positive, False Negative, and overall Accuracy rates. Note, Recall is the inverse of False Negative Rate if your positive class is Fraud.
In only a few clicks you can add widgets to provide a single glance view of your model's Accuracy, False Positive Rate, and False Negative Rate. To visualize these metrics over time you can also create a custom timeseries widget which overlays three plots to showcase the fluctuation of these metrics over time.
Sometimes, we need metrics other than traditional statistical measures to define model performance. Business Impact is a way to measure your scored model's payoff at different thresholds (i.e, decision boundary for a scored model).
When dealing with credit card fraud, often the profit/loss associated with model predictions is not weighted equal. For example, the diagram below might estimate the profit/loss of a decision made by your model.
Visualize the potential profit/loss based on these weighted decision values in Arize's Business Impact tab. Understand your business's overall profit/loss based on your model's prediction threshold for fraud classification.
Log feature importances to the Arize platform to explain your model's predictions. By logging these values, you gain the ability to view the global feature importances of your predictions as well as the ability to perform global and cohort prediction-based analysis to compare feature importances for your model's features.