Catching Hallucinations

Identify patterns of bad responses to go
LLMs are susceptible to “hallucinations”. Hallucinations refer to when LLMs generate text that is factually incorrect or fictional. These are especially a risk when LLMs are deployed into production use cases. Using Arize, teams can score their LLM responses and identify clusters of hallucinations/bad responses.

How it works

Step 1: Log Prompt & Responses

Users log their prompt and responses (+embeddings and metadata) to Arize. For more details on logging, check out our docs on LLM/Generative Model type.

Step 2: Log or Generate Evaluation Metrics

The most common evaluation metrics used for LLMs:
User Feedback
End user feedback (ex: Thumbs up/Thumbs down) on the LLM response.
LLM Assisted Evaluation
Use a second LLM call to evaluate LLM Response. We recommend using the OpenAI Evals Library to find various templates.
Task-Based Metrics
Different metrics for different tasks (Ex: Rouge for summarization, Bleu for translation)

Step 3: Find clusters of bad responses with low evaluation scores

Using these evaluation scores, Arize surfaces clusters of bad responses to focus on for improvement.
Arize automatically surfaces clusters of bad responses to focus on improvement
Select how to sort the clusters by evaluation score and dataset

Step 4: Kick off Workflows for Improvement

Depending on what the accuracy of the task is, there's a lot of immediate improvement that can be realized from leveraging prompt-engineering. As teams hit a wall with prompt engineering, fine tuning can be an option for improvement.
Image by Andrej Karpathy