Q&A on Retrieved Data
When To Use Q&A Eval Template
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.
question: This is the question the Q&A system is running against
sampled_answer: This is the answer from the Q&A system.
context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer
Q&A Eval Template
We are continually iterating our templates, view the most up-to-date template on GitHub.
How To Run the Q&A Eval
The above Eval uses the QA template for Q&A analysis on retrieved data.
Benchmark Results
The benchmarking dataset used was created based on:
Squad 2: The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf
Supplemental Data to Squad 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.
Each example in the dataset was evaluating using the QA_PROMPT_TEMPLATE
above, then the resulting labels were compared against the ground truth in the benchmarking dataset.
GPT-4 Results
GPT-3.5 Results
Claude V2 Results
Precision
1
1
1
1
0.99
0.42
1
1.0
Recall
0.89
0.92
0.98
0.98
0.83
1
0.94
0.64
F1
0.94
0.96
0.99
0.99
0.90
0.59
0.97
0.78
100 Samples
124 Sec
66 sec
67 sec
Last updated