Evaluation Basics
What are the keys to running effective evaluations for your AI applications?
Last updated
Was this helpful?
What are the keys to running effective evaluations for your AI applications?
Last updated
Was this helpful?
In this article, you'll learn how to:
Want a more comprehensive read? View our Definitive Guide on LLM App Evaluation.
How do you define good application performance? You can break it down and measure every component of your LLM application, including the user input, retrieval, LLM response, agent routing, and more.
Here's how we choose our metrics:
Focus on user impact: Prioritize metrics that shape the user experience. Sometimes speed may be really important, other times accuracy may be really important.
Look at your traces: Where you find the highest frequency of problematic traces, that's likely the area where you need to build evaluations
Once you've identified what you want to measure, you need to choose an approach.
Label with Criteria: Assign clear qualitative labels (e.g., truthful/not truthful) to assess your output. Specific criteria will boost the reliability of your evaluation.
Score Numerically: When available with code, you can assign a score between 0.0 to 1.0. You can also use more deterministic metrics (e.g., latency, search ranking) for measurable precision per query.
Compare to References: Compare the outputs of your LLM application against a golden dataset.
Rank between two options: Pit two outputs head-to-head and pick the winner based on preference.
There are three types of evaluators you can build — LLM, code, and annotations. Each has its uses depending on what you want to measure.
One LLM evaluates the outputs of another and provides explanations
Great for qualitative evaluation and direct labeling on mostly objective criteria Poor for quantitative scoring, subject matter expertise, and pairwise preference
Code assesses the performance, accuracy, or behavior of LLMs
Great for reducing cost, latency, and evaluation that can be hard-coded (e.g. code-generation) Poor for qualitative measures such as summarization quality
Humans provide custom labels to LLM traces
Great for evaluating the evaluator, labeling with subject matter expertise, and directional application feedback The most costly and time intensive option
There are also several output formats for your evaluation. Each has its pros and cons.
Categorical (Binary): This is the most straightforward, but can lack nuance in its scoring system.
Categorical (Multi-class): Using several predefined labels for distinct states gives you a lot of flexibility when building your evaluation to capture nuance.
Categorical (Score): This can be pretty useful for averaging your scores across evaluations.
Continuous Score: This is useful for deterministic numeric evaluation metrics, such as latency, but can create false or hallucinated precision when using LLM as a judge.
Evals can occur at different levels of granularity of the evaluation hierarchy.
Span-Level: Zero in on specific components (e.g., retrieval accuracy).
Trace-Level: Analyze trends across full application runs.
Session-Level: Assess performance over multiple interactions.
You can choose how often to measure performance based on cost, time, and importance of your evaluation. We have three methods available in Arize.
Offline evals: Test against your golden dataset before releasing to production. Commonly used with CI/CD pipelines.
Online evals: Evaluate traces in real-time in production.
Guardrails: Flag or block bad outputs during a user query in production.
To continuously evaluate performance across releases, you need to create golden datasets to compare your performance to ground truth. Datasets and experiments can help you create a systematic benchmark for this.
Trying to create a golden dataset and run experiments? Start here.
Dive deeper into Agent Evaluation and using AI to build evals.
Want a more comprehensive read? Read our Definitive Guide on LLM App Evaluation.