Evaluate experiments
Evaluate your experiment results in the Arize UI without code
Last updated
Was this helpful?
Evaluate your experiment results in the Arize UI without code
Last updated
Was this helpful?
Arize has built-in evals for your traces and your experiments. Compare performance across different prompt templates, models, or configurations — all on the same dataset — to identify what works best.
This guide shows you how to evaluate your LLM outputs against a dataset without code. You can save your playground outputs as an experiment, and then kick off an evaluation task to run comparisons with an LLM as a Judge. For advanced use cases, you can evaluate experiments with code using our SDK.
Evaluate experiment with codeTo run your first evaluation on experiments:
Go to the Evals & Tasks page and click "New Task" (top-right corner).
Select "Dataset" as the evaluation target. Only datasets with at least one experiment are eligible.
Choose the experiments you want to evaluate from the dropdown menu.
Task status and aggregate metrics will appear on the Experiments tab of the dataset page. You can also track the task status in the task logs.
To view all evaluators associated with a dataset, click the "Evaluators" link at the top of the dataset page. You can re-run an evaluator by selecting "Run on Experiment" from the dropdown next to the evaluator.