LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.
Arize supports a large number of LLM evaluators out of the box with LLM Classify:
Here's an example of a LLM evaluator that checks for hallucinations in the model output:
from phoenix.evals import llm_classify
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel
HALLUCINATION_PROMPT_TEMPLATE = """
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text.
Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.
# Query: {query}
# Reference text: {reference}
# Answer: {response}
Is the answer above factual or hallucinated based on the query and reference text?
"""
def hallucination_eval(output, dataset_row):
# Get the original query and reference text from the dataset_row
query = dataset_row.get("query")
reference = dataset_row.get("reference")
# Create a DataFrame to pass into llm_classify
df_in = pd.DataFrame(
{"query": query, "reference": reference, "response": output}, index=[0]
)
# Run the LLM classification
eval_df = llm_classify(
dataframe=df_in,
template=HALLUCINATION_PROMPT_TEMPLATE,
model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
rails=["factual", "hallucinated"],
provide_explanation=True,
)
# Map the eval df to EvaluationResult
label = eval_df["label"][0]
score = 1 if label == "factual" else 0
explanation = eval_df["explanation"][0]
# Return the evaluation result
return EvaluationResult(label=label, score=score, explanation=explanation)
In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.
Once you define your evaluator class, you can use it in your experiment run like this: