Evaluate experiment

Evaluate your experiment results in the Arize UI without code

In addition to supporting continuous evaluations on projects, Arize now makes it easy to evaluate experiments. This enables users to compare performance across different prompt templates, models, or configurations — all on the same dataset — to identify what works best.

You can save your playground outputs as an experiment, and then kick off an evaluation task to run comparisons with an LLM as a Judge.

For advanced use cases, you can run an experiment using our SDK and define your own code evaluators.

Running your first evaluator

To run your first evaluation on experiments:

Create a New Task Go to the Evals & Tasks page and click "New Task" (top-right corner).
Choose Dataset (not Project) Select "Dataset" as the evaluation target. Only datasets with at least one experiment are eligible.
Select Your Experiments Choose the experiments you want to evaluate from the dropdown menu. You can compare outputs across different prompt or model configurations.

Viewing your results

Task status and aggregate metrics will appear on the Experiments tab of the dataset page. You can also track the task status in the task logs.

Re-run an evaluator

To view all evaluators associated with a dataset, click the "Evaluators" link at the top of the dataset page. You can re-run an evaluator by selecting "Run on Experiment" from the dropdown next to the evaluator.

Last updated 1 day ago

Was this helpful?