Best Practices for LLM Evaluation
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
We are always publishing our latest guides on building AI applications. You can read our research, watch our paper readings, join our Slack community, or attend Arize University!
Here are some best practices specifically with evaluations for large language models.
Evaluating tasks performed by large language models is particularly challenging. An LLM's output can be repetitive, grammatically incorrect, excessively lengthy, or incoherent. Traditional methods like rule-based assessment or similarity metrics (e.g., ROUGE, BLEU) fall short when evaluating performance.
To address this, "LLM as a Judge" uses an LLM to evaluate the output. Step by step:
Identify your evaluation criteria
Create a prompt template
Select an appropriate LLM for evaluation
Generate your evaluations and view your results
Read more in our Phoenix guide here.
There are a multiple types of evaluations that we support. Each category of evaluation is categorized by its output type.
Although score evals are an option in Phoenix, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.
To explore the full analysis behind our recommendation and understand the limitations of score-based evaluations, check out our research on LLM eval data types.
It can be hard to understand in many cases why an LLM responds in a specific way. Explanations showcase why the LLM decided on a specific score for your evaluation criteria, and may even improve the accuracy of the evaluation.
You can activate this by adding a flag to the llm_classify
and llm_generate
functions by specifying provide_explanation=True
.
View our colab notebook below for more details.
It's important to select the right LLM for the evaluation of your application. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.
You can adjust an existing template or build your own from scratch. Be explicit about the following:
What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.
To benchmark your LLM evaluator, you need to generate a golden dataset. This should be representative of the type of data you expect to see in your application. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.
Building such a dataset is laborious, but you can often find standardized datasets for common use cases. Then, run the eval across your golden dataset and generate metrics (overall accuracy, precision, recall, F1, etc.) to determine your benchmark.