Q&A on Retrieved Data
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.
question: This is the question the Q&A system is running against
sampled_answer: This is the answer from the Q&A system.
context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer
Q&A Eval Template
How To Run the Eval
Benchmark Results
GPT-4 Results
GPT-3.5 Results
Claude V2 Results
Eval
GPT-4o
GPT-4
GPT-4 Turbo
Gemini Pro
GPT-3.5
GPT-3.5 Turbo Instruct
Palm (Text Bison)
Claude V2
Precision
1
1
1
1
0.99
0.42
1
1
Recall
0.89
0.92
0.98
0.98
0.83
1
0.94
0.64
F1
0.94
0.96
0.99
0.99
0.90
0.59
0.97
0.78
Throughput
GPT-4
GPT-4 Turbo
GPT-3.5
100 Samples
124 Sec
66 sec
67 sec
Last updated