Experiments
Setup recurring evaluations against your curated datasets
Last updated
Was this helpful?
Setup recurring evaluations against your curated datasets
Last updated
Was this helpful?
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset.
By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs.
Whether you're testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
A dataset is a collection of examples for evaluating your application. It is commonly represented as a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.
A task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality.
An evaluator is a function that takes the output of a task and provides an assessment.
It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.
To get started with code, check out the Quickstart guide for experiments.
Learn more about evaluation concepts by reading Evaluation Concepts and our definitive guide on LLM app evaluation.
Look at end to end examples of .