Ranking
How to log your model schema for ranking models
Last updated
How to log your model schema for ranking models
Last updated
Copyright © 2023 Arize AI, Inc
There are four key ranking model use cases to consider:
Search Ranking
Collaborative Filtering Recommender Systems
Content Filtering Recommender Systems
Classification-Based Ranking Models
Different metrics are used for ranking model evaluation based on your model use case, score, and label availability. The case determines the available performance metrics. Click here for all valid model types and metric combinations.
Ranking Cases | Example Use Case | Expected Fields | Performance Metrics |
---|---|---|---|
Model predicts score used to rank | rank, relevance score | ||
Model predicts binary actions a user can take which is used to rank | rank, relevance_labels | ||
Model predicts multiple actions a user can take which is used to rank | rank, relevance_labels (list of strings) | ||
Model can also be evaluated using AUC + LogLoss | Ranking Case 2 or 3, prediction score |
Ranking models have a few unique model schema fields that help Arize effectively monitor, trace, and visualize your ranking model data.
Prediction Group ID: A subgroup of prediction data. Max 100 ranked items within each group
Rank: Unique value within each prediction group (1-100)
Relevancy Score/Label: Ground truth score/label associated with the model
In the ranking model context, a relevance score is a ground truth numerical score where the higher the relevance_score
, the more important the item is. For example, if an item was clicked on it may have a relevance score of 0.5, whereas if it was purchased that relevancy score would be 1.
rank
and relevance_score
are required to compute rank-aware evaluation metrics on your model.
Ranking Model Fields | Data Type | Example |
---|---|---|
rank | int from 1-100 |
|
relevance_score | numeric (float | int) |
|
prediction_group_id | string limited to 128 characters |
|
state | price | search_id | rank | relevance_score | prediction_ts |
---|---|---|---|---|---|
|
|
|
|
|
For more details on Python Batch API Reference, visit here:
Pandas Batch LoggingIn this case, relevance_score
does not need to be passed in. Since relevance_score
is required to compute rank-aware evaluation metrics, Arize uses an attribution model to create a relevance_score
based on your positive class and relevance_labels. Learn more about our attribution model here.
Ranking Model Fields | Data Type | Example |
---|---|---|
rank | int from 1-100 |
|
relevance_labels | string |
|
prediction_group_id | string limited to 128 characters |
|
state | price | search_id | rank | actual_relevancy | prediction_ts |
---|---|---|---|---|---|
|
|
|
|
|
For more details on Python Batch API Reference, visit here:
Pandas Batch LoggingSince ground truth can contain multiple events, you can pass in multiple ground truth labels in a list.
In this case relevance_score
does not need to be passed in. Since relevance_score
is required to compute rank-aware evaluation metrics, Arize uses an attribution model to create a relevance_score
based on your positive class and relevance_labels. Learn more about our attribution model here.
Ranking Model Fields | Data Type | Example |
---|---|---|
rank | int from 1-100 |
|
relevance_labels | list of strings |
|
prediction_group_id | string limited to 128 characters |
|
state | price | search_id | rank | attributions | prediction_ts |
---|---|---|---|---|---|
|
|
|
|
|
For more details on Python Batch API Reference, visit here:
Pandas Batch LoggingAUC and LogLoss are computed based on prediction_score
and relevance_labels (or default relevance_labels in the case of multi-label).
Ranking Model Fields | Data Type | Example |
---|---|---|
rank | int from 1-100 |
|
prediction_score | float |
|
prediction_group_id | string limited to 128 characters |
|
Rank-aware evaluation metrics: NDCG @k (MAP @K & MRR coming soon)
Evaluation metrics: AUC, PR-AUC, LogLoss
Normalized discounted cumulative gain (NDCG) is a rank-aware evaluation metric that measures a model's ability to rank query results in the order of the highest relevance (graded relevance). You can read more about how NDCG is computed here.
A relevance score
is required to calculate rank-aware evaluation metrics. If your relevance_score
is unavailable, the Arize platform will calculate a relevance_score
using a simple attribution model with a prediction label
and a relevance label
. Arize computes a binary relevance value (0/1) based on the default positive class.
Positive class "buy" and relevance label is "buy" --> relevance will be attributed to 1.
Positive class "buy" and relevance label is else --> relevance will be attributed to 0.
Positive class "buy" and relevance labels are ["buy", "click", "scroll"] --> relevance will be attributed to sum([1,0,0])
Ranking model: Assigns a rank to each item in a prediction group (also known as a batch or query), across many possible groups.
Arize supports pointwise, pairwise, and listwise ranking models
Prediction Group: A group of predictions within which items are ranked.
Example: A user of a hotel booking site types in a search term (“skiing”) and is presented with a list of results representing a single query
Rank: The predicted rank of an item in a prediction group (Integer between 1-100).
Example: Each item in the search prediction group has a rank determined by the model (i.e. Aspen is assigned rank=1, Tahoe is assigned rank=2, etc. based on input features and query features to the model)
Relevance Score (i.e. Actual Scores): The ground truth relevance score (numeric). Higher scores denote higher relevance.
Example: Each item in the search prediction group has a score determined by the action a user took on the item (i.e. “clicking” on an item indicates relevance score = 0.5, purchasing an item indicates relevance score = 1)
Rank-Aware Evaluation Metric: A rank-aware evaluation metric is an evaluation metric that gauges rank order and relevancy of predictions.
Rank-aware evaluation metrics include NDCG, MRR, and MAP. Note that MRR and MAP also require relevance_labels to be provided to be computed.