Concepts: LLM Evaluation
Grading the performance of your LLM application
The purpose of evaluation
It’s extremely easy to start building an AI application using LLMs because developers no longer have to collect labeled data or train a model. They only need to create a prompt to ask the model for what they want. However, this comes with tradeoffs. LLMs are generalized models that aren’t fine tuned for a specific task. With a standard prompt, these applications demo really well, but in production environments, they often fail in more complex scenarios.
In order to ensure that your application performs well, you need a way to judge the quality of your LLM outputs. You can do this in a variety of different ways, for example evaluating based on relevance, hallucination %, and latency.
When you adjust your prompts or retrieval strategy, you will know whether your application has improved and by how much using evaluation. When running evaluations, it's important to select a comprehensive dataset which will determine how trustworthy and generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.
Creating an LLM evaluation
There are three components to an LLM evaluation:
The input data: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.
The aggregate metric: when you run thousands of evaluations across a large dataset, you can use your aggregation metrics to summarize the quality of your responses over time across different prompts, retrievals, and LLMs.
LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses.
There are several other ways of evaluating your application, including using human labeled feedback and code-based heuristics. Based on what you're trying to evaluate, these approaches could be better suited for your application.
Read our research on task-based evaluation and our best practices guide to learn more!
Last updated