Run Experiments

The following are the key steps of running an experiment illustrated by simple example.

The key steps of running an experiment are:

  1. Define/upload a Dataset (e.g. a dataframe)

    • Each record of the dataset is called an Example

  2. Define a task

    • A task is a function that takes each Example and returns an output

  3. Define Evaluators

    • An Evaluator is a function evaluates the output for each Example

  4. Run the experiment

We'll start by launching the Phoenix app.

import phoenix as px


Load a Dataset

A dataset can be as simple as a list of strings inside a dataframe. More sophisticated datasets can be also extracted from traces based on actual production data. Here we just have a small list of questions that we want to ask an LLM about the NBA games:

Create pandas dataframe

import pandas as pd

df = pd.DataFrame(
        "question": [
            "Which team won the most games?",
            "Which team won the most games in 2015?",
            "Who led the league in 3 point shots?",

The dataframe can be sent to Phoenix via the Client. input_keys and output_keys are column names of the dataframe, representing the input/output to the task in question. Here we have just questions, so we left the outputs blank:

Upload dataset to Phoenix

import phoenix as px

dataset = px.Client().upload_dataset(

Each row of the dataset is called an Example.

Create a Task

A task is any function/process that returns a JSON serializable output. Task can also be an async function, but we used sync function here for simplicity. If the task is a function of one argument, then that argument will be bound to the input field of the dataset example.

def task(x):
    return ...

For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results:

Set Up Database

import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]
conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

Set Up Prompt and LLM

from textwrap import dedent

import openai

client = openai.Client()
columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

LLM_MODEL = "gpt-4o"

columns_str = ",".join(column["column_name"] + ": " + column["column_type"] for column in columns)
system_prompt = dedent(f"""
You are a SQL expert, and you are given a single table named nba with the following columns:
Write a SQL query corresponding to the user's
request. Return just the query text, with no formatting (backticks, markdown, etc.).""")

def generate_query(question):
    response =
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
    return response.choices[0].message.content

def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")

def text2sql(question):
    results = error = None
        results = execute_query(generate_query(question))
    except duckdb.Error as e:
        error = str(e)
    return {"query": query, "results": results, "error": error}

Define task as a Function

Recall that each row of the dataset is encapsulated as Example object. Recall that the input keys were defined when we uploaded the dataset:

def task(x):
    return text2sql(x["question"])

More complex task inputs

More complex tasks can use additional information. These values can be accessed by defining a task function with specific parameter names which are bound to special values associated with the dataset example:

Parameter nameDescriptionExample


example input

def task(input): ...


example output

def task(expected): ...


alias for expected

def task(reference): ...


example metadata

def task(metadata): ...


Example object

def task(example): ...

A task can be defined as a sync or async function that takes any number of the above argument names in any order!

Define Evaluators

An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database:

def no_error(output) -> bool:
    return not bool(output.get("error"))

def has_results(output) -> bool:
    return bool(output.get("results"))

Run an Experiment

Instrument OpenAI

Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examine in the Phoenix UI:

from phoenix.trace.openai import OpenAIInstrumentor


Run the Task and Evaluators

Running an experiment is as easy as calling run_experiment with the components we defined above. The results of the experiment will be show up in Phoenix:

from phoenix.experiments import run_experiment

run_experiment(ds, task=task, evaluators=[no_error, has_results])

Last updated