Create an Experiment Evaluator
How to write the functions to evaluate your task outputs in experiments
You have created your task, and now you'd like to write an evaluator to grade the task outputs in your experiments. If you don't know what these concepts are, read our overview of datasets and experiments.
There are three types of evaluators you can build — LLM, code, and annotations. We support all three, (if you're looking for guidance on how to decide, we got you).
Here's the simplest version of an evaluation function:
You can define a simple function to read the output of a task and check it.
Evaluation Inputs
The evaluator function can take the following optional arguments:
dataset_row
the entire row of the data, including every column as dictionary key
def eval(dataset_row): ...
input
experiment run input, which is mapped to attributes.input.value
def eval(input): ...
output
experiment run output
def eval(output): ...
dataset_output
the expected output if available, mapped to attributes.output.value
def eval(dataset_output): ...
metadata
dataset_row metadata, which is mapped to attributes.metadata
def eval(metadata): ...
Evaluation Outputs
We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.
boolean
True
label = 'True' score = 1.0
float
1.0
score = 1.0
string
"reasonable"
label = 'reasonable'
tuple
(1.0, "my explanation notes")
score = 1.0 explanation = 'my explanation notes'
tuple
("True", 1.0, "my explanation")
label = 'True' score = 1.0 explanation = "my explanation"
EvaluationResult
EvaluationResult(
score=1,
label='reasonable', explanation='explanation'
metadata={}
)
score = 1.0
label='reasonable' explanation = 'explanation' metadata={}
To use EvaluationResult class, use the following import statement:
from arize.experimental.datasets.experiments.types import EvaluationResult
One of label or score must be supplied (you can't have an evaluation with no result).
Here's another example which compares the output to a value in the dataset_row.
To run the experiment, you can load the evaluator into run_experiment
as following:
Learn more
Write an LLM Evaluator for experiments
Customize your code based evaluator for experiments
Last updated
Was this helpful?