Hallucinations
When To Use Hallucination Eval Template
This LLM Eval detects if the output of a model is a hallucination based on contextual data.
This Eval is specifically designed to detect hallucinations in generated answers from private or retrieved data. The Eval detects if an AI answer to a question is a hallucination based on the reference data used to generate the answer.
This Eval is designed to check for hallucinations on private data, on data that is fed into the context window from retrieval.
It is NOT designed to check hallucinations on what the LLM was trained on. It is not useful for random public fact hallucinations "What was Michael Jordan's birthday?"
It is useful for hallucinations in RAG systems
Hallucination Eval Template
We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023
Benchmark Results
GPT-4 Results
GPT-3.5 Results
Claud v2 Results
GPT-4 Turbo
How To Run the Eval
The above Eval shows how to the the hallucination template for Eval detection.
Eval | GPT-4 | GPT-4 Turbo | Gemini Pro | GPT-3.5 | GPT-3.5-turbo-instruct | Palm 2 (Text Bison) | Claude V2 |
---|---|---|---|---|---|---|---|
Precision | 0.93 | 0.97 | 0.89 | 0.89 | 0.89 | 1 | 0.80 |
Recall | 0.72 | 0.70 | 0.53 | 0.65 | 0.80 | 0.44 | 0.95 |
F1 | 0.82 | 0.81 | 0.67 | 0.75 | 0.84 | 0.61 | 0.87 |
Throughput | GPT-4 | GPT-4 Turbo | GPT-3.5 |
---|---|---|---|
100 Samples | 105 sec | 58 Sec | 52 Sec |
Last updated