Concepts: Experiments
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
An experiment allows you to systematically test and validate changes in your LLM applications using a curated dataset. By defining a dataset (a collection of examples), creating tasks to generate outputs, and setting up evaluators to assess those outputs, you can run an experiment to see how well your updated pipeline performs. Whether you're testing improvements with a golden dataset or troubleshooting issues with a problem dataset, experiments provide a structured way to measure the impact of your changes and ensure your application is on the right track.
The development of AI applications is often slowed down by the need for quality evaluations, as AI engineers must navigate challenging trade-offs between performance, latency, and cost. Experiments are crucial in this process because they enable developers to make more informed decisions as they iterate on their models and prompts.
Datasets are collections of examples that provide the inputs and, optionally, expected reference outputs for evaluating your application. These examples are used in experiments to track improvements to your prompt, LLM, or other parts of your LLM application. For more information on creating datasets, check out the Quickstart.
A task is any function or process that produces a JSON-serializable output. Typically, a task replicates the LLM functionality you're aiming to test. For instance, if you've made a prompt change, your task will run the examples through the new prompt to generate an output. The task is used in the experiment to process the dataset, producing outputs that will be evaluated in the next steps.
An evaluator is any function that takes the output of a task and provides an assessment. It serves as the measure of success for your experiment, helping you determine whether your changes have achieved the desired results. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment.