Experiments
Test and validate your LLM applications
Last updated
Was this helpful?
Test and validate your LLM applications
Last updated
Was this helpful?
Experiments help developers systematically test changes in their LLM applications using a curated dataset. Each experiment run is stored independently to measure the impact of changes over time.
A dataset is a collection of examples for evaluating your application. It is commonly represented as a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.
A task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality.
An evaluator is a function that takes the output of a task and provides an assessment.
It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.
or
Quickstart
Create your first experiment
Learn about Evals
Understand where to deploy different kinds of evals