Concepts: Datasets & Experiments
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset. By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs. Whether you're testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
Datasets are collections of examples that provide inputs and outputs for evaluating your application. These examples are used in experiments to track improvements to your prompt, LLM, or other parts of your LLM application.
You can create datasets in three ways:
Manually Curated Examples: This is how we recommend you start. From building your application, you probably have an idea of what types of inputs you expect your application to be able to handle, and what "good" responses look like. You probably want to cover a few different common edge cases or situations you can imagine. Even 20 high quality, manually curated examples can go a long way.
Historical Logs: Once your app is live, you can track key data points, such as spans with negative user feedback or long run times. Use user feedback and heuristics to identify valuable data for refining and testing your application.
Synthetic Data: Once you have a few examples, you can try to artificially generate examples to get a lot of datapoints quickly. It's generally advised to have a few good handcrafted examples before this step, as the synthetic data will often resemble the source examples in some way.
Use datasets to:
Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
Capture generations to assess quality manually or using LLM-graded evals
Store user reviewed generations to find new test cases
A task is any function or process that produces a JSON-serializable output. Typically, a task replicates the LLM functionality you're aiming to test. For instance, if you've made a prompt change, your task will run the examples through the new prompt to generate an output. The task is used in the experiment to process the dataset, producing outputs that will be evaluated in the next steps.
An evaluator is any function that takes the output of a task and provides an assessment. It serves as the measure of success for your experiment, helping you determine whether your changes have achieved the desired results. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment.