Using Evaluators

LLM Evaluators

We provide LLM evaluators out of the box. These evaluators are vendor agnostic and can be instantiated with a Phoenix model wrapper:

from phoenix.experiments.evaluators import HelpfulnessEvaluator
from phoenix.evals.models import OpenAIModel

helpfulness_evaluator = HelpfulnessEvaluator(model=OpenAIModel())

Code Evaluators

Code evaluators are functions that evaluate the output of your LLM task that don't use another LLM as a judge. An example might be checking for whether or not a given output contains a link - which can be implemented as a RegEx match.

phoenix.experiments.evaluators contains some pre-built code evaluators that can be passed to the evaluators parameter in experiments.

from phoenix.experiments import run_experiment, MatchesRegex

# This defines a code evaluator for links
contains_link = MatchesRegex(

The above contains_link evaluator can then be passed as an evaluator to any experiment you'd like to run.

For a full list of code evaluators, please consult repo or API documentation.

Custom Evaluators

The simplest way to create an evaluator is just to write a Python function. By default, a function of one argument will be passed the output of an experiment run. These custom evaluators can either return a boolean or numeric value which will be recorded as the evaluation score.

Imagine our experiment is testing a task that is intended to output a numeric value from 1-100. We can write a simple evaluator to check if the output is within the allowed range:

def in_bounds(x):
    return 1 <= x <= 100

By simply passing the in_bounds function to run_experiment, we will automatically generate evaluations for each experiment run for whether or not the output is in the allowed range.

More complex evaluations can use additional information. These values can be accessed by defining a function with specific parameter names which are bound to special values:

Parameter nameDescriptionExample


experiment run input

def eval(input): ...


experiment run output

def eval(output): ...


example output

def eval(expected): ...


alias for expected

def eval(reference): ...


experiment metadata

def eval(metadata): ...

These parameters can be used in any combination and any order to write custom complex evaluators!

Below is an example of using the editdistance library to calculate how close the output is to the expected value:

pip install editdistance
def edit_distance(output, expected) -> int:
    return editdistance.eval(
        json.dumps(output, sort_keys=True), json.dumps(expected, sort_keys=True)

For even more customization, use the create_evaluator decorator to further customize how your evaluations show up in the Experiments UI.

from phoenix.experiments.evaluators import create_evaluator

# the decorator can be used to set display properties
# `name` corresponds to the metric name shown in the UI
# `kind` indicates if the eval was made with a "CODE" or "LLM" evaluator
@create_evaluator(name="shorter?", kind="CODE")
def wordiness_evaluator(expected, output):
    reference_length = len(expected.split())
    output_length = len(output.split())
    return output_length < reference_length

The decorated wordiness_evaluator can be passed directly into run_experiment!

Last updated