Last updated
Last updated
This quickstart guide will show you through the basics of evaluating data from your LLM application.
The first thing you'll need is a dataset to evaluate. This could be your own collect or generated set of examples, or data you've exported from Phoenix traces. If you've already collected some trace data, this makes a great starting point.
For the sake of this guide however, we'll download some pre-existing data to evaluate. Feel free to sub this with your own data, just be sure it includes the following columns:
reference
query
response
Set up evaluators (in this case for hallucinations and Q&A correctness), run the evaluations, and log the results to visualize them in Phoenix. We'll use OpenAI as our evaluation model for this example, but Phoenix also supports a number of other models. First, we need to add our OpenAI API key to our environment.
Explanation of the parameters used in run_evals above:
dataframe
- a pandas dataframe that includes the data you want to evaluate. This could be spans exported from Phoenix, or data you've brought in from elsewhere. This dataframe must include the columns expected by the evaluators you are using. To see the columns expected by each built-in evaluator, check the corresponding page in the Using Phoenix Evaluators section.
evaluators
- a list of built-in Phoenix evaluators to use.
provide_explanations
- a binary flag that instructs the evaluators to generate explanations for their choices.
Combine your evaluation results and explanations with your original dataset:
Note: You'll only be able to log evaluations to the Phoenix UI if you used a trace or span dataset exported from Phoenix as your dataset in this quickstart. If you've used your own outside dataset, you won't be able to log these results to Phoenix.
Provided you started from a trace dataset, you can log your evaluation results to Phoenix using