To run an experiment in code, you need to define the following things:
And then use the run_experiment function to run the task function against your data, run the evaluation function against the outputs, and log the results and traces to Arize.
Alternatively, you can if you already have your LLM outputs or evaluation outputs.
Defining your dataset
You can a new dataset, or you can your existing dataset in code.
A task is any function that you want to test on a dataset. The simplest version of a task looks like the following:
def task(dataset_row: Dict):
return dataset_row
When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.
Here's an example of how the data passes through the task function from your dataset.
"""Example dataset
dataframe = pd.DataFrame({
"id":[1,2],
"attribute": ["example", "example2"],
"question": ["what is 1+1", "why is the sky blue"],
"expected": ["2", "because i said so"]
})
"""
# for the first row of your data, the answer is in the comments
def task(dataset_row: Dict):
question = dataset_row.get("question") # what is 1+1
expected = dataset_row.get("expected") # 2
attribute = dataset_row.get("attribute") # example
data_id = dataset_row.get("id") # 1
return expected
Let's create a task that uses an LLM to answer a question.
The task function can take the following optional arguments for convenience, which will automatically pass dataset_row attributes to your task function. The easiest way to access anything you need is using dataset_row.
Parameter name
Description
Dataset Row Attribute
Example
dataset_row
the entire row of the data, including every column as dictionary key
We offer several convenience attributes, such as concurrency to reduce time to complete the experiment. You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True, which makes it easier to debug when an experiment doesn't run correctly.
Once your experiment has finished running, you can visualize your experiment results in the Arize UI.
Advanced Options
Setup asynchronous experiments
Experiments can be run as either Synchronous or Asynchronous.
We recommend:
Synchronous: Slower but easier to debug. When you are building your tests these are inherently easier to debug. Start with synchronous and then make them asynchronous.
Asynchronous: Faster. When timing and speed of the tests matter. Make the tasks and/or Evals asynchronous and you can 10x the speed of your runs.
Code errors for synchronous tasks break at the line of the error in the code. They are easier to debug and we recommend using these to develop your tasks and evals.
The synchronous running of an experiment runs one after another. The asynchronous running of an experiment runs in parallel.
Here are some code differences between the two. You just need to add the async keyword before your functions, and run nest_asyncio.apply() .
# Sync task
def prompt_gen_task(example):
print('running task sync')
# Sync evaluation
def evaluate_hallu(output, dataset_row):
print('running eval sync')
# run experiment
experiment1 = arize_client.run_experiment(space_id=space_id,
dataset_id=dataset_id, task=prompt_gen_task, evaluators=[evaluate_hallu],
experiment_name="test"
)
###############
# Async task
async def prompt_gen_task(example):
print('running task sync')
# Async evaluation
async def evaluate_hallu(output, dataset_row):
print('running eval sync')
# Must import this library for async tasks to run properly
import nest_asyncio
nest_asyncio.apply()
# same run experiment function
experiment1 = arize_client.run_experiment(space_id=space_id,
dataset_id=dataset_id, task=prompt_gen_task, evaluators=[evaluate_hallu],
experiment_name="test"
)
Sampling a dataset for an experiment
Running a test on dataset sometimes requires running on random or stratified samples of the dataset. Arize supports running on samples by allowing teams to download a dataframe. That dataframe can be sampled prior to running the experiment.
# Get dataset as Dataframe
dataset_df = arize_client.get_dataset(space_id=SPACE_ID, dataset_name=dataset_name)
# Any sampling methods you want on a DF
sampled_df = dataset_df.sample(n=100) # Sample 100 rows randomly
# Sample 10% of rows randomly
sampled_df = dataset_df.sample(frac=0.1)
# Create proportional sampling based on the original dataset's class label distribution
stratified_sampled_df = dataset_df.groupby('class_label', group_keys=False).apply(lambda x: x.sample(frac=0.1))
# Select every 10th row
systematic_sampled_df = dataset_df.iloc[::10, :]
# Run Experiment on sampled_df
client.run_experiment(space_id, dataset_name, sampled_df, taskfn, evaluators)
An experiment will only matched up with the data that was run against it. You can run experiments with different samples of the same dataset. The platform will take care of tracking and visualization.
Any complex sampling method that can be applied to a dataframe can be used for sampling.
Tracing your experiment
Arize tracers instrumented on experiment code will automatically trace the experiments into the platform.
Tracing Using Explicit Spans
from opentelemetry import trace
# Outer function will be traced by Arize with a span
def task_add_1(dataset_row):
tracer = trace.get_tracer(__name__)
# Start the span for the function
with tracer.start_as_current_span("test_function") as span:
# Extract the number from the dataset row
num = dataset_row['attributes.my_number']
# Set 'num' as a span attribute
span.set_attribute("dataset.my_number", num)
# Return the incremented number
return num + 1
Tracing Using Auto-Instrumentor
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor
# Automatic instrumentation --- This will trace all tasks below with LLM Calls
OpenAIInstrumentor().instrument()
task_prompt_template = "Answer in a few words: {question}"
openai_client = OpenAI()
def task(dataset_row) -> str:
question = dataset_row["question"]
message_content = task_prompt_template.format(question=question)
response = openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": message_content}]
)
return response.choices[0].message.content
To run the experiment with this task, you can load the task into as following.