Search Ranking
Use the Arize platform to troubleshoot a search ranking model's performance
Last updated
Use the Arize platform to troubleshoot a search ranking model's performance
Last updated
Copyright © 2023 Arize AI, Inc
Check out our How to Monitor Ranking Model's blog and follow along with our various Colab examples here.
Ranking models are used by search engines to display query results ranked in the order of the highest relevance. These predictions seek to maximize user actions that are then used to evaluate model performance.
The complexity within a ranking model makes failures challenging to pinpoint as a model’s dimensions expand per recommendation. Notable challenges within ranking models include upstream data quality issues, poor-performing segments, the cold start problem, and more.
This use case will use ML observability to improve a hotel booking ranking model with low model performance. A poor-performing search ranking model means the model is not surfacing the most relevant items in a list first.
You will investigate model performance using NDCG at different @k values to improve this model and accurately predict relevant recommendations in the right order.
You will learn to:
Evaluate NDCG for different k values
Compare datasets to investigate where the model is underperforming
Use the feature performance heatmap to identify the root cause of our model issues
Gain a detailed view of how to log your ranking model schema here.
This use case utilizes a rank-aware evaluation metric, which is used to accurately weigh rank order and relevant recommendations within a list.
Normalized discounted cumulative gain (NDCG) is a rank-aware evaluation metric that measures a model's ability to rank query results in the order of the highest relevance (graded relevance).
Once you ingest ranking data, the Arize platform computes your rank-aware evaluation metric.
The k value determines the metric computation up to position k in a sorted list.
For example, if k = 10, then NDCG evaluates the 10th item within the sorted list, whereas k = 20 evaluates the 20th item within a sorted list.
Use the 'arize-demo-hotel-ranking' model, available in all free accounts, to follow along.
Configure our performance metric in the 'Config' tab.
Default Metric: NDCG
Default @K value: 10
Positive Class: Relevant
NDCG is low for k = 10, which indicates this model inaccurately predicts most recommendations users interact with first.
To investigate further, we'll increase our k value to 20. This helps us zoom out to have a more complete understanding of our ranking performance.
When we increase our k value to 20, our model performance increases. Thus, this model performs better lower in the list, which is not favorable for our use case.
To investigate further, we'll add a comparison dataset using high-performing training data. Adding a comparison dataset during a period of high performance can help easily identify problem areas and enable you to better understand your performance breakdown.
When we add our training dataset, we can confirm a significantly higher NDCG value than production. From here, scroll down to the 'Performance Breakdown' chart to analyze how our feature distributions differ between the two datasets.
When comparing two histograms, look for the following:
different colors (the more red = worse performing)
missing values
As we scroll through our performance breakdown, we notice a significant gap in training data compared to production in the feature search_activity
. From there, click on the card to uncover a more detailed view of search_activity
.
This view uncovers missing training data for skiing, sledding, snowboarding, and ice skating. To confirm our data quality issues, click on the 'View Feature Details' link on the top right, which will navigate us to a page where we visualize our distribution comparison, cardinality, and % empty over time.
The above analysis indicates users have been searching for winter activities while our model is optimized for summer/fall activities and has not been trained on new winter data. To improve our model performance and surface relevant recommendations in the right order, we need to retrain our model to account for a surge in winter destination/activity searches!