Arize AI
Recommendation System
Overview of how to use Arize for recommendation system models

Check out our Recommendation System Use Case Colab for an interactive demo!


Monitor and investigate the performance of a Recommendation System Engine recently pushed into production.
In this use case, you will take on the persona of a machine learning engineer for a premium music service. After spending a great deal of time collecting customer data, training, and testing various models, your team has built an ML-powered recommendation engine to give your listeners personalized playlist recommendations based on their most listened to songs on their sound cloud. After you push your model into production, something goes wrong. You realize that your production model has no tools available to monitor your model performance, identify root cause issues, or gain insights into how to actively improve your model when things inevitably go wrong. Now, you need to learn how to use an ML observability tool.
You will learn to:
  1. 1.
    Get training, validation, and production data into the Arize platform
  2. 2.
    Setup performance monitors and dashboards with a baseline
  3. 3.
    Understand where the model is underperforming
  4. 4.
    Discover the root cause of issues
  5. 5.
    Identify business impact

Set up a Baseline

Set up a baseline for easy comparison of your model's behavior in production to a training set, validation set, or an initial model launch period. Specifically, we want to see when/where our model's behavior has changed drastically. Setting up a baseline will help identify potential model issues, changing user behaviors, or even changes in our model's concepts (what people consider fraud might change over time).
Setting up a baseline will surface changes in the distributions of:
  1. 1.
    Model predictions — played / skipped
  2. 2.
    Model input data — feature values
  3. 3.
    Ground truth/Actual values — was this song actually played

Choosing a baseline

In the case of our recommendation system, we have a validation environment that we want to compare with our production environment. Therefore we are setting up the baseline by selecting pre-prod validation.
We may also want to set the baseline to be a window on the production data in order to compare song recommendations weekly, annually and/or seasonally.


First Choose an Evaluation Metric

The inferences for this use case are the probabilities the user will play the recommended song. Once the recommendations are made for each user, the ground truth collected is an indicator as to whether or not the user listened to the song (or skipped the song) — represented as 0 (skipped) or 1 (played).
For example:
Recommended Song
Actual Score
Livin' On A Prayer BY Bon Jovi | User skipped song
User listened to song
Paradise City BY Guns N' Roses
User skipped song
For a recommendation system will be using precision as our default performance metric to monitor model performance since precision is about retrieving the best items to the user (assuming that there are more useful items available).
Precision is the number of selected items that are relevant. So suppose our recommender system selects 3 items to recommend to users out of which 2 are relevant then precision will be 66%.
See equation:
Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}
TP = song recommended + song played
FP = song recommended + song not played

Configure Monitors

In just a few clicks, Arize automatically configures monitors that are best suited to your data to proactively detect drift, data quality, and performance issues.
  1. 1.
    Default Metric: Precision, Trigger Alert When: Precision is < .6, Positive Class: Played
  2. 2.
    Turn On Monitoring: Drift ✅, Data Quality ✅, Performance ✅

Drift Detection

Identify Model, Feature, and Actual Drift

Visualize feature, model, and actual drift between various model environments and versions to identify fraud patterns, data quality issues, and anomalous distribution behavior.
For our recommendation system, anomalous distribution changes could be signs of issues depending on where they are happening. Arize provides drift over time widgets overlaid with your metric of choice (in our case, precision) to clearly determine if drift is contributing to our performance degradation.
During initial model setup, Arize automatically created a set drift monitors and dashboards for each feature available in the dataset. Drift is mathematically defined as the Population Stability Index (PSI) over a given period, essentially it is a symmetric KL Divergence. We will use these graphs to monitor overall trends.
From the model overview page, select the Affiliate Provider feature and you should notice the change in the distribution. Use the PSI graph to select a period of interest. We see that not has the expected amount of direct links has decreased in production, as well as a new input has been seen by the model. Notice that as the new affiliate provider facebook appears, the default threshold set by Arize was crossed, this is due to the feature drift being caused by inputs not used in the training baseline. At this point it might be a good time to train your model on the new input.

Data Quality Checks

In the production data, values like facebook are being recorded against the Affiliate Provider feature that were not part of the training data. You already noticed some drift on the Affiliate Provider feature and you can further investigate by navigating to the Data Quality tab. Make sure to view the last 30 days by selecting the correct range in the top right corner of the screen. Arize keeps tracks of feature cardinality as well as fields with no data and can pin-point the exact time that this issue started.
You are also able to dive into data quality issues via model Overview tab.
To automatically be notified of bad values getting introduced in production create custom monitor for your features. From the model Monitors tab click New Monitor and chose a monitor type (drift, performance, or data quality).

Performance Analysis

Model performance metrics measure how well your model performs in production. Once a performance monitor is triggered, navigate to the Performance tab to start troubleshooting your model issues and gain an understanding of what caused the degradation.
Compare production to training or other windows of production. Bring in another dataset to compare performance and see which model performs better. This can help answer questions such as "Were we seeing this problem in training?" or "Does my new / previous model version perform better?". It can also be helpful to compare to other windows of production.
Identify low performing segments. By looking at performance breakdown by feature, you can dig even deeper to see which segment within each feature of the model is underperforming.
Here we are:
1. Checking the performance over time overlaid with volume. Precision ~76%
2. Analyzing the confusion matrix, calibration curve, and Performance impact score.
3. Observing Austin is having a very poor performance and very strong impact on the precision score.
4. Excluding Austin as a filter.
5. Demonstrating that by excluding Austin the overall performance in the production environment has increase to 83%

Root cause analysis for low performing cohorts in template view

When using the performance dashboards we have a 2D view of the data, when we use our Heatmap view we can take analysis a step future with a 3D view. We will now select templates on the left side bar, select Feature Performance Heatmap and fill out the template.
The Feature Performance Heatmap provides you with model performance information across all features, at various feature/value combinations — also known as a slice. Feature Performance Heatmaps also support conditional filters.
This view has automatically identified the first step in improving overall model performance by improving the slices with the poorest performing predictions, which were in Austin withHeavy Metal, Hard Rock.

Custom Dashboard

Now that we have looked over data quality and feature drift, we will investigate model performance with Arize's performance dashboards for a single glance view of the model's important metrics. While you are able to fully customize dashboards for your team, we will use our Scored Model Performance Dashboard template from the template tab in the left-hand sidebar.
Dashboards are comprised of widgets designed for different types of analysis across your training, validation, and production environments:
  • Distribution Widget for analyzing data distribution changes over Feature, Prediction, and Actuals.
  • TimeSeries Widget for analyzing time-based data.
  • Statistic Widget for getting an aggregate statistic. Data Metrics and Evaluation Metrics charts are also available for this widget.
You can also slice and filter dashboards by any model, model version, and model dimension. As you can see when we filter on features and inputs to those features (like major cities) and investigate further at the cohort level.


Log feature importances to the Arize platform to explain your model's predictions. By logging these values, you gain the ability to view the global feature importances of your predictions as well as the ability to perform global and cohort prediction-based analysis to compare feature importances for your model's features.


Check out our Recommendation System Use Case Colab for an interactive demo!