Evaluation Basics
What are the keys to running effective evaluations for your AI applications?
Last updated
Was this helpful?
What are the keys to running effective evaluations for your AI applications?
Last updated
Was this helpful?
Application performance is not a singular metric. It needs to be comprehensively measured across every part of your application. Each part of an LLM app or agent can be evaluated, including the user input, the retrieval step, the LLM response, the agent router, and more.
Start by measuring what directly impacts the user experience of your application. Sometimes it is okay to be wrong if the user can easily ignore the suggested output, but it is not okay to be slow.
Once you've identified what you want to measure, you need to choose an approach. The most common approaches are:
Direct labeling using criteria — This is the easiest to get started. You assign a qualitative label, such as truthful, vs. not truthful. Each criteria should assess one aspect at a time. The more specific the criteria, the more reliable the output will be.
Numeric scoring — When you can measure performance deterministically with a number, you can evaluate quality with a metric per query, such as search ranking or latency.
Reference-based comparison — You must build a golden dataset of outputs that you know are good. Then, you compare the outputs to the ones generated by your LLM to determine scoring.
Pairwise preference — You put two outputs against each other and ask which one the evaluator prefers.
Categorical (Binary): The evaluation results in a binary output, such as true/false or yes/no, which can be easily represented as 1/0. This simplicity makes it straightforward for decision-making processes but lacks the ability to capture nuanced judgements.
Categorical (Multi-class): The evaluation results in one of several predefined categories or classes, which could be text labels or distinct numbers representing different states or types.
Continuous Score: The evaluation results in a numeric value within a set range (e.g. 1-10), offering a scale of measurement. We don’t recommend using this approach.
Categorical Score: A value of either 1 or 0. The categorical score can be pretty useful as you can average your scores but don’t have the disadvantages of continuous range.
There are three types of evaluators you can build — LLM, code, and annotations. Each has its uses depending on what you want to measure.
How it Works
One LLM evaluates the outputs of another and provides explanations
Code assesses the performance, accuracy, or behavior of LLMs
Humans provide custom labels to LLM traces
Use Case
Great for qualitative evaluation and direct labeling on mostly objective criteria Poor for quantitative scoring, subject matter expertise, and pairwise preference
Great for reducing cost, latency, and evaluation that can be hard-coded (e.g. code-generation) Poor for qualitative measures such as summarization quality
Great for evaluating the evaluator, labeling with subject matter expertise, and directional application feedback The most costly and time intensive option
Evals can occur at different levels of granularity of the evaluation hierarchy.
span-level evaluations assess the performance of specific components within an application’s response
trace-level evaluation looks at trends across multiple full runs of an application
session-level evaluation expands the scope to multiple interactions
You can choose how often to measure performance based on cost, time, and importance. We have three methods available in Arize.
Offline evals: run an evaluation of your application against your golden dataset, prior to releasing to production. Commonly used with CI/CD pipelines.
Online evals: evaluate traces in production and track performance in real-time.
Guardrails: Run real-time checks to block or flag bad outputs in production.
Learn more about evaluation concepts by reading our definitive guide on LLM app evaluation.
Dive deeper into Agent Evaluation and using AI to build evals.