AI vs Human (Groundtruth)
This LLM evaluation is used to compare AI answers to Human answers. Its very useful in RAG system benchmarking to compare the human generated groundtruth.
Last updated
This LLM evaluation is used to compare AI answers to Human answers. Its very useful in RAG system benchmarking to compare the human generated groundtruth.
Last updated
A workflow we see for high quality RAG deployments is generating a golden dataset of questions and a high quality set of answers. These can be in the range of 100-200 but provide a strong check for the AI generated answers. This Eval checks that the human ground truth matches the AI generated answer. Its designed to catch missing data in "half" answers and differences of substance.
Question:
What Evals are supported for LLMs on generative models?
Human:
Arize supports a suite of Evals available from the Phoenix Evals library, they include both pre-tested Evals and the ability to configure cusotm Evals. Some of the pre-tested LLM Evals are listed below:
Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Code Execution, Hallucination Detection and Summarizaiton
AI:
Arize supports LLM Evals.
Eval:
Incorrect
Explanation of Eval:
The AI answer is very brief and lacks the specific details that are present in the human ground truth answer. While the AI answer is not incorrect in stating that Arize supports LLM Evals, it fails to mention the specific types of Evals that are supported, such as Retrieval Relevance, Question and Answer, Toxicity, Human Groundtruth vs AI, Citation Reference Link Relevancy, Code Readability, Hallucination Detection, and Summarization. Therefore, the AI answer does not fully capture the substance of the human answer.
Overview of template:
GPT-4 Results
GPT-4o | GPT-4 | GPT-4 Turbo | Gemini Pro | |
---|---|---|---|---|
Precision
0.90
0.92
0.87
0.78
Recall
0.56
0.74
0.69
0.87
F1
0.69
0.82
0.77
0.82