LLM as a Judge

Judge LLM outputs using LLMs

The standard for evaluating text is human labeling. As LLMs become smarter, faster, and cheaper to use, LLM evaluation has become a popular way to evaluate application performance, especially for qualitative measurements that cannot be replicated with code evaluations. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.

Components of an LLM evaluation

LLM evaluation requires you to define:

The input dataset: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.

LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses. You can run thousands of evaluations across curated data without the need for human annotation. This speeds up your prompt iteration and ensures you can deploy your applications to production with confidence.

Learn more

Run evaluations in the UI

Run online evals on your traces and spans without code

Run evaluations with code

Run evaluations using our Phoenix SDK

Learn about LLM evaluation concepts

Understand when to use the different kinds of evals

View Arize evaluation templates

View our pre-tested eval templates

Read our guide on building LLM benchmarks

Deep dive into LLM evaluation and benchmarking

Last updated 14 days ago

Was this helpful?