Retrieval (RAG) Relevance
When To Use RAG Eval Template
This Eval evaluates whether a retrieved chunk contains an answer to the query. It's extremely useful for evaluating retrieval systems.
RAG Eval Template
We are continually iterating our templates, view the most up-to-date template on GitHub.
How To Run the RAG Relevance Eval
The above runs the RAG relevancy LLM template against the dataframe df.
Benchmark Results
This benchmark was obtained using notebook below. It was run using the WikiQA dataset as a ground truth dataset. Each example in the dataset was evaluating using the RAG_RELEVANCY_PROMPT_TEMPLATE
above, then the resulting labels were compared against the ground truth label in the WikiQA dataset to generate the confusion matrices below.
GPT-4 Result
GPT-3.5 Results
Claude V2 Results
GPT 4 Turbo
Precision
0.60
0.70
0.68
0.61
0.42
0.53
0.79
Recall
0.77
0.88
0.91
1
1.0
1
0.22
F1
0.67
0.78
0.78
0.76
0.59
0.69
0.34
100 Samples
113 Sec
61 sec
73 Sec
Last updated