Run Experiments
The following are the key steps of running an experiment illustrated by simple example.
Setup
Make sure you have Phoenix and the instrumentors needed for the experiment setup. For this example we will use the OpenAI instrumentor to trace the LLM calls.
Run Experiments
The key steps of running an experiment are:
Define/upload a
Dataset
(e.g. a dataframe)Each record of the dataset is called an
Example
Define a task
A task is a function that takes each
Example
and returns an output
Define Evaluators
An
Evaluator
is a function evaluates the output for eachExample
Run the experiment
We'll start by launching the Phoenix app.
Load a Dataset
A dataset can be as simple as a list of strings inside a dataframe. More sophisticated datasets can be also extracted from traces based on actual production data. Here we just have a small list of questions that we want to ask an LLM about the NBA games:
Create pandas dataframe
The dataframe can be sent to Phoenix
via the Client
. input_keys
and output_keys
are column names of the dataframe, representing the input/output to the task in question. Here we have just questions, so we left the outputs blank:
Upload dataset to Phoenix
Each row of the dataset is called an Example
.
Create a Task
A task is any function/process that returns a JSON serializable output. Task can also be an async
function, but we used sync function here for simplicity. If the task is a function of one argument, then that argument will be bound to the input
field of the dataset example.
For our example here, we'll ask an LLM to build SQL queries based on our question, which we'll run on a database and obtain a set of results:
Set Up Database
Set Up Prompt and LLM
Define task
as a Function
Recall that each row of the dataset is encapsulated as Example
object. Recall that the input keys were defined when we uploaded the dataset:
More complex task
inputs
More complex tasks can use additional information. These values can be accessed by defining a task function with specific parameter names which are bound to special values associated with the dataset example:
input
example input
def task(input): ...
expected
example output
def task(expected): ...
reference
alias for expected
def task(reference): ...
metadata
example metadata
def task(metadata): ..
.
example
Example
object
def task(example): ...
A task
can be defined as a sync or async function that takes any number of the above argument names in any order!
Define Evaluators
An evaluator is any function that takes the task output and return an assessment. Here we'll simply check if the queries succeeded in obtaining any result from the database:
Run an Experiment
Instrument OpenAI
Instrumenting the LLM will also give us the spans and traces that will be linked to the experiment, and can be examine in the Phoenix UI:
Run the Task and Evaluators
Running an experiment is as easy as calling run_experiment
with the components we defined above. The results of the experiment will be show up in Phoenix:
Add More Evaluations
If you want to attach more evaluations to the same experiment after the fact, you can do so with evaluate_experiment
.
evaluate_experiment
.If you no longer have access to the original experiment
object, you can retrieve it from Phoenix using the get_experiment
client method.
Dry Run
Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment()
and evaluate_experiment()
both are equipped with a dry_run=
parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True
selects one sample from the dataset, and setting it to a number, e.g. dry_run=3
, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.
Last updated