Overview of how to use Arize for insurance models
In this walkthrough, we will use the Arize platform to monitor a Lending Prediction model's performance.
This use case will walk you through how to manage your model's performance for your lending models. While you have spent a great deal of your time collecting online data and training models for it's best performance in the training environment, it's common that your models in production have a limited number of tools available to monitor your model performance, identify root cause issues, or gain insights into how to actively improve your model.
In this walkthrough, we will look at a few scenarios common to lending models when monitoring for performance.
You will learn to:
- 1.Get training, validation, and production data into the Arize platform
- 2.Create a baseline
- 3.Setup performance monitors for False Negative Rate, False Positive Rate, Mean Error, and Accuracy
- 4.Set up customized dashboards
- 5.Discover the root cause of issues
- 6.Identify business impact
Set up a baseline for easy comparison of your model's behavior in production to a training set, validation set, or an initial model launch period. Specifically, we want to see when/where our model's behavior has changed drastically. Setting up a baseline will help identify potential model issues, changing user behaviors, or even changes in our model's concepts (what people consider fraud might change over time).
Setting up a baseline will surface changes in the distributions of:
- 1.Model predictions —
- 2.Model input data — feature values
- 3.Ground truth/Actual values — was this loan prediction actually
In the case of our lending model, we have a training environment that we want to compare production to. Therefore we are setting up the baseline by selecting pre-prod training.
With lending decisions, often times your training set may contain an exaggerated amount of defaults for the purpose of training your model. However, in the real world, loans are much more likely to be paid off than to be defaulted on. If this is the case for your training environment, you may want to select an initial model launch period from production as your baseline (assuming that period of time looks as expected).
Training Version 1.0
- Default Metric:
Accuracy, Trigger Alert When:
Accuracy is below 0.6, Positive Class:
- Turn On Monitoring: Drift ✅, Data Quality ✅, Performance ✅
Arize sets up monitors across all features, predictions, and actual values. In lending it's important to monitor your model's:
- 1.False Negative Rate — Chargeback % (
deniedloans that were identified by the model as
approvedleading to a chargeback/immediate financial loss for the company).
- 2.False Positive Rate — Upset Customer % (
approvedloans that were classified as
deniedleading to an awkward moment at the register and an upset customer).
- 3.Accuracy — of all my predictions, what percent did the model predict correctly? We need to be careful with this metric as it can be potentially misleading, especially if there is a small amount of defaults. If a dataset has 5% of it labeled
deny, misclassifying all transactions can still result in 95% model accuracy.
In just a few clicks, Arize automatically configures monitors that are best suited to your data to proactively detect drift, data quality, and performance issues.
Drift is a change in distribution over time, measured for model inputs, outputs, and actuals of a model. Measure drift to identify if your models have grown stale, you have data quality issues, or if there are adversarial inputs in your model. Detecting drift in your models will help protect your models from performance degradation and allow you to better understand how to begin resolution.
Prediction Drift Impact can surface when drift has impacted your model. Drift (PSI) is a measurement of how much your distribution has drifted. Lastly, Feature Importance helps your explain why even small Drift (PSI) can have significant Drift Impact.
Visualize feature and model drift between various model environments and versions to identify loan defaulting patterns and anomalous distribution behavior. Arize provides drift over time widgets overlaid with your metric of choice (in our case, Accuracy) to clearly determine if drift is contributing to our performance degradation.
Here we see two important features (
purpose) drifting, which likely means data drift is causing performance degradation. In addition to the baseline and current distributions diverging from each other, we also see the input
credit_cardin the feature
purposethat are only seen in the production data and not the baseline dataset. In this case, where the baseline is your training dataset, you should retrain your model with the new data.
With the insights provided on Arize, you can deep dive into root causes and quickly gain intuitions, allowing for ML teams to quickly iterate, experiment, and ship new models in production.
It’s important to immediately surface data quality issues to identify how your data quality maps to your model’s performance. Utilize data quality monitoring to analyze hard failures in your data quality pipeline, such as missing data or cardinality shifts.
- Missing / Null values could be an indicator of issues from an upstream data source.
- Cardinality and Quantiles are checked to ensure there are no spikes / drops in feature values.
Model performance metrics measure how well your model performs in production. Once a performance monitor is triggered, navigate to the Performance tab to start troubleshooting your model issues and gain an understanding of w hat caused the degradation.
Compare production to training or other windows of production. Bring in another dataset to compare performance and see which model performs better. This can help answer questions such as "Were we seeing this problem in training?" or "Does my new / previous model version perform better?". It can also be helpful to compare to other windows of production.
Identify low performing segments. By looking at performance breakdown by feature, you can dig even deeper to see which segment within each feature of the model is underperforming.
In the case of lending, the most important metrics to worry about are your False Positive, False Negative, and overall Accuracy rates. Note, Recall is the inverse of False Negative Rate if your positive class is Defaulted.
In only a few clicks you can add widgets to provide a single glance view of your model's Accuracy, False Positive Rate, and False Negative Rate. To visualize these metrics over time you can also create a custom timeseries widget which overlays three plots to showcase the fluctuation of these metrics over time.
Log feature importances to the Arize platform to explain your model's predictions. By logging these values, you gain the ability to view the global feature importances of your predictions as well as the ability to perform global and cohort prediction-based analysis to compare feature importances for your model's features.