Create an Experiment Evaluator
How to write the functions to evaluate your task outputs in experiments
Last updated
Was this helpful?
How to write the functions to evaluate your task outputs in experiments
Last updated
Was this helpful?
You have , and now you'd like to write an evaluator to grade the task outputs in your experiments. If you don't know what these concepts are, read our .
There are three types of evaluators you can build — LLM, code, and annotations. We support all three, (if you're looking for guidance on how to decide, ).
Here's the simplest version of an evaluation function:
You can define a simple function to read the output of a task and check it.
The evaluator function can take the following optional arguments:
dataset_row
the entire row of the data, including every column as dictionary key
def eval(dataset_row): ...
input
experiment run input, which is mapped to attributes.input.value
def eval(input): ...
output
experiment run output
def eval(output): ...
dataset_output
the expected output if available, mapped to attributes.output.value
def eval(dataset_output): ...
metadata
dataset_row metadata, which is mapped to attributes.metadata
def eval(metadata): ...
We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.
boolean
True
label = 'True' score = 1.0
float
1.0
score = 1.0
string
"reasonable"
label = 'reasonable'
tuple
(1.0, "my explanation notes")
score = 1.0 explanation = 'my explanation notes'
tuple
("True", 1.0, "my explanation")
label = 'True' score = 1.0 explanation = "my explanation"
EvaluationResult
EvaluationResult(
score=1,
label='reasonable', explanation='explanation'
metadata={}
)
score = 1.0
label='reasonable' explanation = 'explanation' metadata={}
from arize.experimental.datasets.experiments.types import EvaluationResult
One of label or score must be supplied (you can't have an evaluation with no result).
Here's another example which compares the output to a value in the dataset_row.
To use , use the following import statement:
To run the experiment, you can load the evaluator into as following:
Write an for experiments
Customize your for experiments