Evaluations on Experiments
Last updated
Was this helpful?
Last updated
Was this helpful?
In addition to supporting continuous evaluations on projects, Arize now makes it easy to evaluate experiments. This enables users to compare performance across different prompt templates, models, or configurations — all on the same dataset — to identify what works best.
With Evaluations on Experiments, everything happens through a fully guided UI workflow, so no coding is required. You can:
Kick off an evaluation task to run comparisons with an LLM as a Judge
For advanced use cases, coders can launching an experiment via SDK and define their own code evaluators.
To run your first evaluation on experiments:
Create a New Task Go to the Evals & Tasks page and click "New Task" (top-right corner).
Choose Dataset (not Project) Select "Dataset" as the evaluation target. Only datasets with at least one experiment are eligible.
Select Your Experiments Choose the experiments you want to evaluate from the dropdown menu. You can compare outputs across different prompt or model configurations.
Track Progress and Metrics Once started, task status and aggregate metrics will appear on the Experiments tab of the dataset page. You can also track the task status in the task logs.
Drill Into Results Click into an experiment to view evaluation labels, LLM-generated explanations, and metadata like model name and prompt template.
Manage and Re-Run Evaluators To view all evaluators associated with a dataset, click the "Evaluators" link at the top of the dataset page. You can re-run an evaluator by selecting "Run on Experiment" from the dropdown next to the evaluator.