Running Evals on Traces
How to use an LLM judge to label and score your application
Last updated
Was this helpful?
How to use an LLM judge to label and score your application
Last updated
Was this helpful?
This guide will walk you through the process of evaluating traces captured in Phoenix, and exporting the results to the Phoenix UI.
This process is similar to the , but instead of creating your own dataset or using an existing external one, you'll export a trace dataset from Phoenix and log the evaluation results to Phoenix.
Note: if you're self-hosting Phoenix, swap your collector endpoint variable in the snippet below, and remove the Phoenix Client Headers variable.
Now that we have Phoenix configured, we can register that instance with OpenTelemetry, which will allow us to collect traces from our application here.
For the sake of making this guide fully runnable, we'll briefly generate some traces and track them in Phoenix. Typically, you would have already captured traces in Phoenix and would skip to "Download trace dataset from Phoenix"
Now that we have our trace dataset, we can generate evaluations for each trace. Evaluations can be generated in many different ways. Ultimately, we want to end up with a set of labels and/or scores for our traces.
You can generate evaluations using:
Plain code
Other evaluation packages
As long as you format your evaluation results properly, you can upload them to Phoenix and visualize them in the UI.
Let's start with a simple example of generating evaluations using plain code. OpenAI has a habit of repeating jokes, so we'll generate evaluations to label whether a joke is a repeat of a previous joke.
We now have a DataFrame with a column for whether each joke is a repeat of a previous joke. Let's upload this to Phoenix.
Our evals_df has a column for the span_id and a column for the evaluation result. The span_id is what allows us to connect the evaluation to the correct trace in Phoenix. Phoenix will also automatically look for columns named "label" and "score" to display in the UI.
You should now see evaluations in the Phoenix UI!
From here you can continue collecting and evaluating traces, or move on to one of these other guides:
Phoenix's
Your own
If you're interested in more complex evaluation and evaluators, start with
If you're ready to start testing your application in a more rigorous manner, check out