LLM as a Judge
Judge LLM outputs using LLMs
Last updated
Was this helpful?
Judge LLM outputs using LLMs
Last updated
Was this helpful?
The standard for evaluating text is human labeling. As LLMs become smarter, faster, and cheaper to use, LLM evaluation has become a popular way to evaluate application performance, especially for qualitative measurements that cannot be replicated with code evaluations. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.
LLM evaluation requires you to define:
The input dataset: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.
LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses. You can run thousands of evaluations across curated data without the need for human annotation. This speeds up your prompt iteration and ensures you can deploy your applications to production with confidence.
Run evaluations in the UI
Run online evals on your traces and spans without code
Run evaluations with code
Run evaluations using our Phoenix SDK
Learn about LLM evaluation concepts
Understand when to use the different kinds of evals
View Arize evaluation templates
View our pre-tested eval templates
Read our guide on building LLM benchmarks
Deep dive into LLM evaluation and benchmarking