Evaluation
Grading the performance of your LLM application
Last updated
Was this helpful?
Grading the performance of your LLM application
Last updated
Was this helpful?
To ensure application performance, you need a way to judge the quality of your LLM outputs.
Without evals, AI engineers don’t know if prompt changes will actually improve performance. They don't know whether changing a prompt, LLM parameter, agentic loop, or retrieval step will break a use case or improve performance.
With evals, when you adjust your prompts, agents, and retrieval, you learn whether your application performance has improved or not.
There are three types of evaluators you can build — LLM, code, and annotations. Each has its uses depending on what you want to measure.
One LLM evaluates the outputs of another and provides explanations
Great for qualitative evaluation and direct labeling on mostly objective criteria Poor for quantitative scoring, subject matter expertise, and pairwise preference
Code assesses the performance, accuracy, or behavior of LLMs
Great for reducing cost, latency, and evaluation that can be hard-coded (e.g. code-generation) Poor for qualitative measures such as summarization quality
Humans provide custom labels to LLM traces
Great for evaluating the evaluator, labeling with subject matter expertise, and directional application feedback The most costly and time intensive option
Wherever you find the highest frequency of problematic traces is where you can start building evaluations. We have many pre-built templates for common evaluation cases for user frustration, hallucination, and agent planning, and function calling.
You can setup these evaluations to create a set of metrics to measure your application performance. In the example below, we are judging the quality of a customer support chatbot on relevance, hallucination, and latency.
LLM evaluation requires you to define:
The input data: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.
The aggregate metric: when you run thousands of evaluations across a large dataset, you can use your aggregation metrics to summarize the quality of your responses over time across different prompts, retrievals, and LLMs.
LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses. You can run thousands of evaluations across curated data without the need for human annotation. This speeds up your prompt iteration and ensures you can deploy your applications to production with confidence.
You can adjust an existing template or build your own from scratch. Experiment with different models and LLM parameters. Be explicit about the following:
What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.
Then benchmark your evaluation template based on your own data. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.
Building such a dataset is laborious, but you can often find standardized datasets for common use cases. Then, run the eval across your golden dataset and generate metrics (overall accuracy, precision, recall, F1, etc.) to determine your benchmark.