Quickstart: Experiments

This guide helps you run experiments to test and validate changes in your LLM applications against a curated dataset. Learn more about the concepts of experiments here.

You can create and run an experiment with the following steps:

  1. Install dependencies and get your API key

  2. Create your dataset

  3. Define your LLM task

  4. Define your evaluation criteria

  5. Run the experiment

Install dependencies and get your API key

Install Arize Client

pip install 'arize[Datasets]' openai pandas 'arize-phoenix[evals]'

Setup client

Grab your Space ID and API key from the UI

from arize.experimental.datasets import ArizeDatasetsClient

client = ArizeDatasetsClient(developer_key="*****")

Create your dataset

You can copy paste the sample code below, but for a more comprehensive guide, checkout the quickstart: datasets guide.

import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE

data = [{
    "persona": "An aspiring musician who is writing their own songs",
    "problem": "I often get stuck overthinking my lyrics and melodies.",
}]

df = pd.DataFrame(data)

dataset_id = client.create_dataset(
    space_id="YOUR_SPACE_ID", 
    dataset_name="Your Dataset",
    dataset_type=GENERATIVE,
    data=df
)
print(f"Dataset created with ID: {dataset_id}")

Define your LLM task

Here is where you define the LLM task you are trying to test for its effectiveness. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.

The input is the dataset_row so you can access the variables in your dataset, and the output is a string.

# Define our prompt for an AI product manager
PROMISE_GENERATOR = """You are an expert product manager. I will provide you with context on my current thinking for persona and problem. Your job is to suggest the best promise based on the context provided. The promise should help the persona reach an outcome they care about and solve their primary problem. Return only the promise and nothing else.

[BEGIN CONTEXT]
Persona: {persona}
Problem: {problem}
[END CONTEXT]
"""

# Setup our OpenAI client
from openai import OpenAI
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Define our task
def task(dataset_row) -> str:
    # Import our variables defined above from our dataset
    problem = dataset_row["problem"]
    persona = dataset_row["persona"]
    
    # Setup the prompt
    user_message = {
        "role": "user",
        "content": PROMISE_GENERATOR.format(problem=problem, persona=persona)
    }
    
    # Run the LLM and return the response
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[user_message], temperature=1
    )
    return response.choices[0].message.content

Define your evaluation criteria

An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.

In this example, we will create our own custom LLM template to judge whether the output satisfied the criteria. There are a few key concepts you need to know to create an evaluator.

  1. class Evaluator - when creating a custom Evaluator, you need to define its name and the Evaluator.evaluate() function.

  2. Evaluator.evaluate(self, output, dataset_row) - this function takes in the output from the LLM task above along with the dataset_row, which has context and prompt variables, so you can assess the quality of the task output. You must return an EvaluationResult.

  3. EvaluationResult - this class requires three inputs so we can display evalutions correctly in Arize.

    1. score - this is a numeric score which we aggregate and display in the UI across your experiment runs. Must be a value in between 0 and 1, inclusive.

    2. label - this is a classification label to determine success or failure of your evaluation, such as ("Hallucinated", "Factual").

    3. explanation - this is a string where you can provide why the output passed or failed the evaluation criteria.

  4. llm_classify - this function uses Arize Phoenix to run LLM-based evaluations. See here for the API reference.

# import our libraries
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator
)

from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

import pandas as pd

# Create out Evaluation Template
CUSTOM_EVAL_TEMPLATE = """You are evaluating whether the product promise satisfies the following criteria given the following context.

[Persona]: {persona}
[Problem]: {problem}
[Product Promise]: {output}

[BEGIN CRITERIA]
The product promise should be something that the persona would be excited to try.
[END CRITERIA]

Your potential answers:
(A) The product promise is compelling to this persona to solve this problem.
(B) The product promise is not compelling to this persona for this problem.
(C) It cannot be determined whether this is compelling to the user.

Return a single letter from this list ["A", "B", "C"]
"""

class PromiseEvaluator(Evaluator):
    name = "Promise Evaluator"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        # Get prompt variables
        problem = dataset_row['problem']
        persona = dataset_row['persona']

        # Setup rails and dataframe
        rails = ["A", "B", "C"]
        df_in = pd.DataFrame({"output": output, "problem": problem, "persona": persona}, index=[0])
        print(df_in)

        # Run evaluation
        eval_df = llm_classify(
            dataframe=df_in,
            template=CUSTOM_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True
        )

        # Output results for experiment
        label = eval_df['label'][0]
        score = 1 if label.upper() == "A" else 0
        explanation = eval_df['explanation'][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)

Run the experiment

## Run Experiment
client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=task,
    evaluators=[PromiseEvaluator()],
    experiment_name="PM-experiment",
)

See the experiment result in Arize

Navigate to the dataset in the UI and see the experiment output table

Last updated

Copyright © 2023 Arize AI, Inc