Overview of how to use Arize for click-through rate models
In this walkthrough, we will use the Arize platform to monitor Click-through Rate (CTR) model performance.This use case will walk you through how to manage your model's performance for an online advertising platform. While you have spent a great deal of your time collecting online data and training models for best performance, it's common that your models in production have no tools available to monitor the performance of your models, identify any issues or get insights into how to improve your models. In this walkthrough, we will look at a few scenarios common to an advertisement use-case and more specifically look at CTR predictions versus actuals for a given ad or ad group.You will learn to:
- 1.Get training, validation, and production data into the Arize platform
- 2.Setup a baseline and performance dashboards
- 3.Create threshold alerts
- 4.Monitor for Log-Loss
- 5.Understand where the model is underperforming
- 6.Discover the root cause of issues
Within Arize, you are able to set a baseline for the model which compares your model's behavior in production to a training set, validation set, or an initial model launch period. This allows us to determine when/where our model's behavior has changed drastically.
Training Version 1.0
- Default Metric:
Accuracy, Trigger Alert When:
Accuracy is below .7, Positive Class:
- Turn On Monitoring: Drift ✅, Data Quality ✅, Performance ✅
Arize sets up monitors across all features, predictions, and actual values. For click-through rate, it's important to monitor the model's Accuracy.
Drift is a change in distribution over time, measured for model inputs, outputs, and actuals of a model. Measure drift to identify if your models have grown stale, you have data quality issues, or if there are adversarial inputs in your model. Detecting drift in your models will help protect your models from performance degradation and allow you to better understand how to begin resolution.
Prediction Drift Impact can surface when drift has impacted your model. Drift (PSI) is a measurement of how much your distribution has drifted. Lastly, Feature Importance helps your explain why even small Drift (PSI) can have significant Drift Impact.
It’s important to immediately surface data quality issues to identify how your data quality maps to your model’s performance. Utilize data quality monitoring to analyze hard failures in your data quality pipeline, such as missing data or cardinality shifts.
- Missing / Null values could be an indicator of issues from an upstream data source.
- Cardinality is checked to ensure there are no spikes / drops in feature values.
Model performance metrics measure how well your model performs in production. Once a performance monitor is triggered, navigate to the Performance tab to start troubleshooting your model issues and gain an understanding of what caused the degradation.
Compare production to training or other windows of production. Bring in another dataset to compare performance and see which model performs better. This can help answer questions such as "Were we seeing this problem in training?" or "Does my new / previous model version perform better?". It can also be helpful to compare to other windows of production.
Identify low performing segments. By looking at performance breakdown by feature, you can dig even deeper to see which segment within each feature of the model is underperforming.
Root Cause Analysis Walkthrough
We can see that our type I error (false positive error) has significantly increased. Our model is predicting that many more clicks than in actuality. Our models is expecting a large amount of users to be clicking on a given ad, when in fact they are not. If we look into which cohorts are performing worst, we can peak in our dual histograms. From here we can see that there are large deviations in the device and domain features.
It seems that our performance degradation is due to unseen populations in the device and domain category. Maybe this would be a indication that we should dig deeper into these cohorts and better understand how we want to handle these never before seen populations in the model
Now that we understand what is affecting our model we can now:
- Retrain the model in these brand new cohorts inside the device and domain features
- Handle the empty values in our data pipelines, affecting our data quality
In only a few clicks, you can add widgets to provide a single glance view of your model's import metrics and KPI. To visualize these metrics over time you can also create a custom time series widget which overlays three plots to showcase the fluctuation of these metrics over time.
Below we'll set up a templatized dashboard and then adjust the template to match our use case.
Now let's make a simple customization to our template. You can refine the Prediction Score vs Actual Score by Day graph by adding a similar plot with these filters:
- Pred Shopping: Use Aggregation Function : Average with Average of set to Prediction Score with filter (feature category = [shopping]). Also add a filter (feature domain != [new_site.com])
Log feature importances to the Arize platform to explain your model's predictions. By logging these values, you gain the ability to view the global feature importances of your predictions as well as the ability to perform global and cohort prediction-based analysis to compare feature importances for your model's features.