This guide helps you run experiments to test and validate changes in your LLM applications against a curated dataset. Learn more about the concepts of experiments here.
You can create and run an experiment with the following steps:
Grab your Space ID, Developer Key, and API key from the UI
from arize.experimental.datasets import ArizeDatasetsClient
client = ArizeDatasetsClient(api_key="*****", developer_key="*****")
Create your dataset
Datasets are useful groupings of data points you want to use for test cases. You can create datasets through code, using LLMs to auto-generate them, or by importing spans using the Arize UI. Below is sample code to create a dataset from a JSON dataframe.
# Setup Datasets client
import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE
# Create dataframe to upload
data = [{"topic": "Zebras"}]
df = pd.DataFrame(data)
# Create dataset in Arize
dataset_id = arize_client.create_dataset(
dataset_name="haiku-topics"+ str(uuid1())[:5],
data=df,
space_id=space_id,
dataset_type=GENERATIVE
)
print(f"Dataset created with ID: {dataset_id}")
If you have added tracing to your application, you can create datasets by adding spans from your application with Arize. Go to the traces page and filter for the examples you care about.
In the example below, we are filtering for spans with a hallucination label, and adding them to a dataset.
Use โจAI search to curate your dataset with natural language.โจ
Define your LLM task
Here is where you define the LLM task you are trying to test for its effectiveness. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.
The input is the dataset_row so you can access the variables in your dataset, and the output is a string.
import os
import openai
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("๐ Enter your OpenAI API key: ")
# Create task which creates a haiku based on a topic
def create_haiku(dataset_row) -> str:
topic = dataset_row.get("topic")
openai_client = openai.OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Write a haiku about {topic}"}],
max_tokens=20
)
assert response.choices
return response.choices[0].message.content
Define your evaluation criteria
An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.
We use our OSS package Arize Phoenix to run LLM-based evaluations, using llm_classify. See here for the API reference.
from phoenix.evals import (
OpenAIModel,
llm_classify,
)
CUSTOM_TEMPLATE = """
You are evaluating whether tone is positive, neutral, or negative
[Message]: {output}
Respond with either "positive", "neutral", or "negative"
"""
def is_positive(output):
df_in = pd.DataFrame({"output": output}, index=[0])
eval_df = llm_classify(
dataframe=df_in,
template=CUSTOM_TEMPLATE,
model=OpenAIModel(model="gpt-4o"),
rails=["positive", "neutral", "negative"],
provide_explanation=True
)
# return score, label, explanation
return (1, eval_df['label'][0], eval_df['explanation'][0])
Run the experiment
## Run Experiment
client.run_experiment(
space_id=space_id,
dataset_id=dataset_id,
task=task,
evaluators=[is_positive],
experiment_name="haiku-experiment",
)
See the experiment result in Arize
Navigate to the dataset in the UI and see the experiment output table.