This LLM evaluation is used to compare AI answers to Human answers. Its very useful in RAG system benchmarking to compare the human generated groundtruth.
A workflow we see for high quality RAG deployments is generating a golden dataset of questions and a high quality set of answers. These can be in the range of 100-200 but provide a strong check for the AI generated answers. This Eval checks that the human ground truth matches the AI generated answer. Its designed to catch missing data in "half" answers and differences of substance.
Overview of template
print(HUMAN_VS_AI_PROMPT_TEMPLATE)You are comparing a human ground truth answer from an expert to an answer from an AI model.Your goal is to determine if the AI answer correctly matches,in substance, the human answer. [BEGIN DATA]************ [Question]:{question}************ [Human Ground Truth Answer]:{correct_answer}************ [AI Answer]:{ai_generated_answer}************ [END DATA]Compare the AI answer to the human ground truth answer,if the AI correctly answers the question,then the AI answer is"correct". If the AI answer is longer but contains the main idea of theHuman answer please answer "correct". If the AI answer divergences or does not contain the mainidea of the human answer, please answer "incorrect".
How to run Eval:
from phoenix.evals import (
HUMAN_VS_AI_PROMPT_RAILS_MAP,
HUMAN_VS_AI_PROMPT_TEMPLATE,
OpenAIModel,
llm_classify,
)
model = OpenAIModel(
model_name="gpt-4",
temperature=0.0,
)
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
dataframe=df,
template=HUMAN_VS_AI_PROMPT_TEMPLATE,
model=model,
rails=rails,
verbose=False,
provide_explanation=True
)