Hallucinations
When To Use Hallucination Eval Template
This LLM Eval detects if the output of a model is a hallucination based on contextual data.
This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.
This Eval is designed to check for hallucinations on private data, specifically on data that is fed into the context window from retrieval.
It is not designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations. E.g. "What was Michael Jordan's birthday?"
Hallucination Eval Template
We are continually iterating our templates, view the most up-to-date template on GitHub.
How To Run the Hallucination Eval
The above Eval shows how to the the hallucination template for Eval detection.
Benchmark Results
This benchmark was obtained using notebook below. It was run using the HaluEval QA Dataset as a ground truth dataset. Each example in the dataset was evaluating using the HALLUCINATION_PROMPT_TEMPLATE
above, then the resulting labels were compared against the is_hallucination
label in the HaluEval dataset to generate the confusion matrices below.
GPT-4 Results
GPT-3.5 Results
Claud v2 Results
GPT-4 Turbo
Precision
0.93
0.97
0.89
0.89
0.89
1
0.80
Recall
0.72
0.70
0.53
0.65
0.80
0.44
0.95
F1
0.82
0.81
0.67
0.75
0.84
0.61
0.87
100 Samples
105 sec
58 Sec
52 Sec
Last updated