Evaluate experiments

Evaluate your experiment results in the Arize UI without code

Arize has built-in evals for your traces and your experiments. Compare performance across different prompt templates, models, or configurations — all on the same dataset — to identify what works best.

This guide shows you how to evaluate your LLM outputs against a dataset without code. You can , and then kick off an evaluation task to run comparisons with an LLM as a Judge. For advanced use cases, you can evaluate experiments with code using our SDK.

Create your first evaluator

To run your first evaluation on experiments:

Go to the Evals & Tasks page and click "New Task" (top-right corner).
Select "Dataset" as the evaluation target. Only datasets with at least one experiment are eligible.
Choose the experiments you want to evaluate from the dropdown menu.

View your results

Task status and aggregate metrics will appear on the Experiments tab of the dataset page. You can also track the task status in the task logs.

Re-run an evaluator

To view all evaluators associated with a dataset, click the "Evaluators" link at the top of the dataset page. You can re-run an evaluator by selecting "Run on Experiment" from the dropdown next to the evaluator.

Last updated 20 days ago

Was this helpful?