Overview: Datasets & Experiments
Setup recurring evaluations against your curated datasets
Last updated
Was this helpful?
Setup recurring evaluations against your curated datasets
Last updated
Was this helpful?
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset.
By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs.
Whether you're testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
Datasets are collections of examples that provide inputs and outputs for evaluating your application. These examples are used in experiments to track improvements to your prompt, LLM, or other parts of your LLM application.
The simplest way to think of a dataset is a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.
You can create datasets in three ways:
Upload manually curated examples: You probably have a good idea what good inputs and outputs look like for your application, and can cover a few common edge cases.
Historical Logs: Once your app is live, you can track key data points, such as spans with negative user feedback or long run times.
Synthetic Data: Using LLMs, you can artificially generate examples to get a lot of datapoints quickly. Use a few good handcrafted examples to help generate good synthetic data.
A task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality. An example:
If you've made a prompt change, you can use tasks to run through your new prompt with the same examples in your dataset.
An evaluator is a function that takes the output of a task and provides an assessment.
It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.
Here's an example checking the output is within bounds of 1 to 100.
To get started with code, check out the Quickstart guide for experiments.
Learn more about evaluation concepts by reading Evaluation Basics and our definitive guide on LLM app evaluation.
Look at end to end examples of Agent Evaluation, RAG Evaluation, and Voice Evaluation.