LLM Eval Experiment

LLM Evaluators

LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.

Arize supports a large number of LLM evaluators out of the box with LLM Classify:

Using Arize Evaluators

Here's an example of a LLM evaluator that checks for hallucinations in the model output:

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    llm_classify,
)
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel

class HallucinationEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:  
        print("Evaluating outputs")
        expected_output = dataset_row["attributes.llm.output_messages"]
        
        # Create a DataFrame with the actual and expected outputs
        df_in = pd.DataFrame(
            {"selected_output": output, "expected_output": expected_output}, index=[0]
        )
        # Run the LLM classification
        expect_df = llm_classify(
            dataframe=df_in,
            template=HALLUCINATION_PROMPT_TEMPLATE,
            model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
            rails=HALLUCINATION_PROMPT_RAILS_MAP,
            provide_explanation=True,
        )
        label = expect_df["label"][0]
        score = 1 if label == rails[1] else 0  # Score 1 if output is incorrect
        explanation = expect_df["explanation"][0]
        
        # Return the evaluation result
        return EvaluationResult(score=score, label=label, explanation=explanation)

In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.

Once you define your evaluator class, you can use it in your experiment run like this:

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[CorrectnessEvaluator()],
    experiment_name=experiment_name,
)

You can customize LLM evaluators to suit your experiment's needs, whether you're checking for hallucinations, function choice, or other criteria where an LLM's judgment is valuable. Simply update the template with your instructions and the rails with the desired output. You can also have multiple LLM evaluators in a single experiment to assess different aspects of the output simultaneously.

Need help writing a custom evaluator template? Use Copilot to write one for you

Last updated

Copyright © 2023 Arize AI, Inc