Quickstart: Experiments

This guide helps you run experiments to test and validate changes in your LLM applications against a curated dataset. Learn more about the .

Upload a CSV as a dataset

Download this and upload it into the UI. The CSV must have an id column. See example CSV below:

id,topic
1,zebras
2,clouds

Test a prompt in playground

Load the dataset you created into prompt playground, and run it to see your results. It's dead simple. Once you've finished the run, you can save it as an experiment to track your changes.

Run an evaluator on your playground experiments

Overview

The key steps of running an experiment are:

Install dependencies and get your API key
Create a dataset
Define a task
Define your evaluation criteria
Run the experiment

Install dependencies and get your API key

Install Arize Client

pip install 'arize[Datasets]' openai pandas 'arize-phoenix[evals]'

Setup client

Grab your Space ID and API key from the API keys page

from arize.experimental.datasets import ArizeDatasetsClient

client = ArizeDatasetsClient(api_key="*****")

Create your dataset

Datasets are useful groupings of data points you want to use for test cases. You can create datasets through code, using LLMs to auto-generate them, or by importing spans using the Arize UI. Here's an example of creating them with code.

# Setup Datasets client
import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE

# Create dataframe to upload
data = [{"topic": "Zebras"}]
df = pd.DataFrame(data)

# Create dataset in Arize
dataset_id = arize_client.create_dataset(
    dataset_name="haiku-topics"+ str(uuid1())[:5],
    data=df,
    space_id=space_id,
    dataset_type=GENERATIVE
)

print(f"Dataset created with ID: {dataset_id}")

Define your LLM task

Here is where you define the LLM task you are trying to test for its effectiveness. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.

The input is the dataset_row so you can access the variables in your dataset, and the output is a string.

import os
import openai
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

# Create task which creates a haiku based on a topic
def create_haiku(dataset_row) -> str:
    topic = dataset_row.get("topic")
    openai_client = openai.OpenAI()
    response = openai_client.chat.completions.create(
	    model="gpt-4o-mini",
	    messages=[{"role": "user", "content": f"Write a haiku about {topic}"}],
	    max_tokens=20
    )
    return response.choices[0].message.content

Define your evaluation criteria

An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.

from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
)

CUSTOM_TEMPLATE = """
You are evaluating whether tone is positive, neutral, or negative

[Message]: {output}

Respond with either "positive", "neutral", or "negative"
"""

def tone_eval(output):
    df_in = pd.DataFrame({"output": output}, index=[0])
    eval_df = llm_classify(
        dataframe=df_in,
        template=CUSTOM_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=["positive", "neutral", "negative"],
        provide_explanation=True,
    )
    # return score, label, explanation
    return EvaluationResult(
        score=1,
        label=eval_df["label"][0],
        explanation=eval_df["explanation"][0],
    )

Run the experiment

To run an experiment, you need to specify the space you are logging it to, the dataset_id, the task you would like to run, and the list of evaluators defined on the output. This also logs the traces to Arize so you can debug each run.

client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=create_haiku,
    evaluators=[tone_eval],
    experiment_name="haiku-experiment",
    exit_on_error=True
)

See the experiment result in Arize

Navigate to the dataset in the UI and see the experiment output table.

Last updated 22 days ago

Was this helpful?