Identify patterns of bad responses to go
LLMs are susceptible to “hallucinations”. Hallucinations refer to when LLMs generate text that is factually incorrect or fictional. These are especially a risk when LLMs are deployed into production use cases. Using Arize, teams can score their LLM responses and identify clusters of hallucinations/bad responses.
Users log their prompt and responses (+embeddings and metadata) to Arize. For more details on logging, check out our docs on LLM/Generative Model type.
The most common evaluation metrics used for LLMs:
End user feedback (ex: Thumbs up/Thumbs down) on the LLM response.
LLM Assisted Evaluation
Use a second LLM call to evaluate LLM Response. We recommend using the OpenAI Evals Library to find various templates.
Different metrics for different tasks (Ex: Rouge for summarization, Bleu for translation)
Using these evaluation scores, Arize surfaces clusters of bad responses to focus on for improvement.
Arize automatically surfaces clusters of bad responses to focus on improvement
Select how to sort the clusters by evaluation score and dataset
Depending on what the accuracy of the task is, there's a lot of immediate improvement that can be realized from leveraging prompt-engineering. As teams hit a wall with prompt engineering, fine tuning can be an option for improvement.
Image by Andrej Karpathy