This LLM evaluation is used to compare AI answers to Human answers. Its very useful in RAG system benchmarking to compare the human generated groundtruth.
A workflow we see for high quality RAG deployments is generating a golden dataset of questions and a high quality set of answers. These can be in the range of 100-200 but provide a strong check for the AI generated answers. This Eval checks that the human ground truth matches the AI generated answer. Its designed to catch missing data in "half" answers and differences of substance.
Overview of Template
You are comparing a human ground truth answer from an expert to an answer from an AI model.
Your goal is to determine if the AI answer correctly matches, in substance, the human answer.
[BEGIN DATA]
************
[Question]: {question}
************
[Human Ground Truth Answer]: {correct_answer}
************
[AI Answer]: {ai_generated_answer}
************
[END DATA]
Compare the AI answer to the human ground truth answer, if the AI correctly answers the question,
then the AI answer is "correct". If the AI answer is longer but contains the main idea of the
Human answer please answer "correct". If the AI answer divergences or does not contain the main
idea of the human answer, please answer "incorrect".
How to run Eval:
from phoenix.evals import ( HUMAN_VS_AI_PROMPT_RAILS_MAP, HUMAN_VS_AI_PROMPT_TEMPLATE, OpenAIModel, llm_classify,)model =OpenAIModel( model_name="gpt-4", temperature=0.0,)# The rails is used to hold the output to specific values based on the template# It will remove text such as ",,," or "..."# Will ensure the binary value expected from the template is returnedrails =list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())relevance_classifications =llm_classify( dataframe=df, template=HUMAN_VS_AI_PROMPT_TEMPLATE, model=model, rails=rails, verbose=False, provide_explanation=True)