Experiment Evals
Evaluate your experiment results in the Arize UI without code
Last updated
Was this helpful?
Evaluate your experiment results in the Arize UI without code
Last updated
Was this helpful?
Running experiments helps you systematically test and refine your AI applications across different prompt templates, models, or configurations — all on the same dataset — to identify what works best.
But instead of inspecting every experiment run item by hand, Arize now makes it easy to evaluate experiments. You can , and then kick off an evaluation task to run comparisons with an LLM as a Judge.
For advanced use cases, you can and define your own .
To run your first evaluation on experiments:
Create a New Task Go to the Evals & Tasks page and click "New Task" (top-right corner).
Choose Dataset (not Project) Select "Dataset" as the evaluation target. Only datasets with at least one experiment are eligible.
Select Your Experiments Choose the experiments you want to evaluate from the dropdown menu. You can compare outputs across different prompt or model configurations.
Task status and aggregate metrics will appear on the Experiments tab of the dataset page. You can also track the task status in the task logs.
To view all evaluators associated with a dataset, click the "Evaluators" link at the top of the dataset page. You can re-run an evaluator by selecting "Run on Experiment" from the dropdown next to the evaluator.