Building Eval Templates
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
It's important to select the right LLM for the evaluation of your application. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.
You can adjust an existing template or build your own from scratch. Be explicit about the following:
What is the input? In our example, it is the documents/context that was retrieved and the query from the user.
What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query
What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.
To benchmark your LLM evaluator, you need to generate a golden dataset. This should be representative of the type of data you expect to see in your application. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.
Building such a dataset is laborious, but you can often find standardized datasets for common use cases. Then, run the eval across your golden dataset and generate metrics (overall accuracy, precision, recall, F1, etc.) to determine your benchmark.