Best Practices for LLM Evaluation

We are always publishing our latest guides on building AI applications. You can read our research, watch our paper readings, join our Slack community, or attend Arize University!

Here are some best practices specifically with evaluations for large language models.

LLM as a Judge

A brief description of how LLMs work as evaluators

Evaluating tasks performed by large language models is particularly challenging. An LLM's output can be repetitive, grammatically incorrect, excessively lengthy, or incoherent. Traditional methods like rule-based assessment or similarity metrics (e.g., ROUGE, BLEU) fall short when evaluating performance.

To address this, "LLM as a Judge" uses an LLM to evaluate the output. Step by step:

  1. Identify your evaluation criteria

  2. Create a prompt template

  3. Select an appropriate LLM for evaluation

  4. Generate your evaluations and view your results

Read more in our Phoenix guide here.

Eval Data Types

There are a multiple types of evaluations that we support. Each category of evaluation is categorized by its output type.

Although score evals are an option in Phoenix, we recommend using categorical evaluations in production environments. LLMs often struggle with the subtleties of continuous scales, leading to inconsistent results even with slight prompt modifications or across different models. Repeated tests have shown that scores can fluctuate significantly, which is problematic when evaluating at scale.

To explore the full analysis behind our recommendation and understand the limitations of score-based evaluations, check out our research on LLM eval data types.

Evals with Explanations

An example of an explanation for an evaluation

It can be hard to understand in many cases why an LLM responds in a specific way. Explanations showcase why the LLM decided on a specific score for your evaluation criteria, and may even improve the accuracy of the evaluation.

You can activate this by adding a flag to the llm_classify and llm_generate functions by specifying provide_explanation=True.

View our colab notebook below for more details.

LLM Selection for Evaluation

It's important to select the right LLM for the evaluation of your application. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.

Building Great Eval Templates

You can adjust an existing template or build your own from scratch. Be explicit about the following:

  • What is the input? In our example, it is the documents/context that was retrieved and the query from the user.

  • What are we asking? In our example, we’re asking the LLM to tell us if the document was relevant to the query

  • What are the possible output formats? In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).

The more specific you are about how to classify or grade a response, the more accurate your LLM evaluation will become. Here is an example of a custom template which classifies a response to a question as positive or negative.

MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {response}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

Benchmarking Eval Templates

To benchmark your LLM evaluator, you need to generate a golden dataset. This should be representative of the type of data you expect to see in your application. The golden dataset should have the “ground truth” label so that we can measure performance of the LLM eval template. Often such labels come from human feedback.

Building such a dataset is laborious, but you can often find standardized datasets for common use cases. Then, run the eval across your golden dataset and generate metrics (overall accuracy, precision, recall, F1, etc.) to determine your benchmark.

Last updated

Copyright © 2023 Arize AI, Inc