Last updated
Was this helpful?
Last updated
Was this helpful?
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset.
By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs.
Whether you're testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
Datasets are collections of examples that provide inputs and outputs for evaluating your application. These examples are used in experiments to track improvements to your prompt, LLM, or other parts of your LLM application.
The simplest way to think of a dataset is a pandas Dataframe, which is a list of dictionaries. Those dictionaries can contain input messages, expected outputs, metadata, or any other tabular data you would like to observe and test.
You can create datasets in three ways:
A task is any function that you want to test on a dataset. Usually, this task replicates LLM functionality. An example:
If you've made a prompt change, you can use tasks to run through your new prompt with the same examples in your dataset.
An evaluator is a function that takes the output of a task and provides an assessment.
It serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations.
Here's an example checking the output is within bounds of 1 to 100.
: You probably have a good idea what good inputs and outputs look like for your application, and can cover a few common edge cases.
: Once your app is live, you can track key data points, such as spans with negative user feedback or long run times.
: Using LLMs, you can artificially generate examples to get a lot of datapoints quickly. Use a few good handcrafted examples to help generate good synthetic data.
To get started with code, check out the for experiments.
Learn more about evaluation concepts by reading and .
Look at end to end examples of , , and .
Setup recurring evaluations against your curated datasets