LLM Evaluations

How Are LLM Evaluations Used?

Often, the first step in troubleshooting an LLM is first identifying where the LLM responded poorly, or even hallucinated. In order to do this, an evaluation metric needs to be used or generated.
Similarly, performance metrics are used to analyze the performance of a traditional ML model (e.g. Accuracy, False Negative Rate, MAPE, etc.).
Arize helps collect all the responses and user feedback, or help generate them, so teams can look at bad responses for further troubleshooting and analysis.

Identify Poor Responses

Select the evaluation metric you want to measure on Performance Tracing.
On the Table view, make sure the Actual Class (e.g. evaluation metric used) or User Feedback columns are selected in the Primary Columns selector. Then sort those columns from lowest score to highest.

Types of Evaluation Metrics

LLM-Assisted Evaluation
Use a second LLM call to evaluate the LLM response or context relevance. The Arize Phoenix Evals is an open Source library that is used to generate LLM Evals.
User Feedback
End user feedback (ex: Thumbs up/Thumbs down) on the LLM response.
Task-Based Metrics
Different metrics for different tasks (Ex: Rouge for summarization, Bleu for translation)
While end user feedback is considered most valuable, it is often much harder to come by. Either the sample size of users leaving feedback is small, or there is no format for the user to leave their feedback.
A great alternative when end user feedback is not available, is LLM-assisted evaluation. In this scenario, a secondary LLM is called to evaluate the response of your LLM application or context relevance in RAG. Arize Phoenix LLM Evals contains various templates to get started with LLM assisted evaluations.

LLM-Assisted Evaluation Use Cases

There are many use cases for LLM-assisted evaluation, included are common use cases:
  • ​Q&A Correcetness: The first evaluation many want to get, especially when user feedback is not available, is an evaluation of the final response. Did the response my LLM gave answer the user's query correctly and sufficiently? The secondary LLM is sent the user's query and the response, and scores how well the response answered the question. This evaluation helps highlight where the bad responses / hallucinations are occurring for further analysis.
  • ​Context Relevance: In RAG use cases, an important question to ask when analyzing performance is, how relevant was the context my LLM retrieved to the user's query? In this instance, a secondary LLM is given the user query alongside the context retrieved, and asked to rank or score the relevance of the context. This can help identify if the root cause of a bad response in a retrieval system is that there isn't enough relevant context in a knowledge base or the most similar document was not the most relevant document.
  • ​Toxicity: Evaluating toxic responses.
  • ​Summarization: This Eval helps evaluate the summarization results of a summarization task.
  • ​Hallucinations: This LLM Eval detects if the output of a model is a hallucination based on contextual data. This Eval is designed specifically designed for hallucinations relative to private or retrieved data, is an answer to a a question an hallucination based on a set of contextual data.
Please see the other Arize Phoenix Evals here​

How to Log Evaluation Metrics

What's After Evaluation?

Depending on what the accuracy of the task is, there's a lot of immediate improvement that can be realized from leveraging prompt-engineering. As teams hit a wall with prompt engineering, fine tuning can be an option for improvement.
Image by Andrej Karpathy