Last updated
Last updated
This LLM Eval detects if the output of a model is a hallucination based on contextual data.
This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.
This Eval is designed to check for hallucinations on private data, specifically on data that is fed into the context window from retrieval.
It is not designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations. E.g. "What was Michael Jordan's birthday?"
The above Eval shows how to the the hallucination template for Eval detection.
We are continually iterating our templates, view the most up-to-date template .
This benchmark was obtained using notebook below. It was run using the as a ground truth dataset. Each example in the dataset was evaluating using the HALLUCINATION_PROMPT_TEMPLATE
above, then the resulting labels were compared against the is_hallucination
label in the HaluEval dataset to generate the confusion matrices below.
Precision
0.93
0.97
0.89
0.89
0.89
1
0.80
Recall
0.72
0.70
0.53
0.65
0.80
0.44
0.95
F1
0.82
0.81
0.67
0.75
0.84
0.61
0.87
100 Samples
105 sec
58 Sec
52 Sec