Using Evaluators
LLM Evaluators
We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:
Code Evaluators
Code evaluators are functions that evaluate the output of your LLM task that don't use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match.
phoenix.experiments.evaluators
contains some pre-built code evaluators that can be passed to the evaluators
parameter in experiments.
The above contains_link
evaluator can then be passed as an evaluator to any experiment you'd like to run.
For a full list of code evaluators, please consult repo or API documentation.
Custom Evaluators
The simplest way to create an evaluator is just to write a Python function. By default, a function of one argument will be passed the output
of an experiment run. These custom evaluators can either return a boolean
or numeric value which will be recorded as the evaluation score.
Imagine our experiment is testing a task
that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:
By simply passing the in_bounds
function to run_experiment
, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.
More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:
input
experiment run input
def eval(input): ...
output
experiment run output
def eval(output): ...
expected
example output
def eval(expected): ...
reference
alias for expected
def eval(reference): ...
metadata
experiment metadata
def eval(metadata): ...
These parameters can be used in any combination and any order to write custom complex evaluators!
Below is an example of using the editdistance
library to calculate how close the output is to the expected value:
For even more customization, use the create_evaluator
decorator to further customize how your evaluations show up in the Experiments UI.
The decorated wordiness_evaluator
can be passed directly into run_experiment
!
Last updated