Run experiments with code

How to create an LLM task to use in experiments

To run an experiment in code, you need to define the following things:

And then use the run_experiment function to run the task function against your data, run the evaluation function against the outputs, and log the results and traces to Arize.

Alternatively, you can if you already have your LLM outputs or evaluation outputs.

Define your dataset

You can a new dataset, or you can your existing dataset in code.

arize_client = ArizeDatasetsClient(api_key=API_KEY)

dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)

Define your task function

A task is any function that you want to test on a dataset. The simplest version of a task looks like the following:

def task(dataset_row: Dict):
    return dataset_row

When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.

Here's an example of how the data passes through the task function from your dataset.

"""Example dataset
dataframe = pd.DataFrame({
    "id":[1,2], 
    "attribute": ["example", "example2"],
    "question": ["what is 1+1", "why is the sky blue"],
    "expected": ["2", "because i said so"]
})
"""

# for the first row of your data, the answer is in the comments
def task(dataset_row: Dict):
    question = dataset_row.get("question") # what is 1+1
    expected = dataset_row.get("expected") # 2
    attribute = dataset_row.get("attribute") # example
    data_id = dataset_row.get("id") # 1
    return expected

Let's create a task that uses an LLM to answer a question.

import openai

def answer_question(dataset_row) -> str:
    question = dataset_row.get("question") # what is 1+1
    openai_client = openai.OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        max_tokens=20,
    )
    assert response.choices
    return response.choices[0].message.content

Task inputs

The task function can take the following optional arguments for convenience, which will automatically pass dataset_row attributes to your task function. The easiest way to access anything you need is using dataset_row.

Parameter name

Description

Dataset Row Attribute

Example

dataset_row

the entire row of the data, including every column as dictionary key

def task_fn(dataset_row): ...

input

experiment run input

attributes.input.value

def task_fn(input): ...

expected

the expected output

attributes.output.value

def task_fn(expected): ...

metadata

metadata for the function

attributes.metadata

def task_fn(metadata): ...

Run the experiment

arize_client.run_experiment(
    space_id="",
    dataset_id="", 
    task=answer_question,
    evaluators=[], 
    experiment_name="",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

We offer several convenience attributes, such as concurrency to reduce time to complete the experiment. You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True, which makes it easier to debug when an experiment doesn't run correctly.

Once your experiment has finished running, you can visualize your experiment results in the Arize UI.

Advanced Options

Setup asynchronous experiments

Experiments can be run as either Synchronous or Asynchronous.

We recommend:

Synchronous: Slower but easier to debug. When you are building your tests these are inherently easier to debug. Start with synchronous and then make them asynchronous.
Asynchronous: Faster. When timing and speed of the tests matter. Make the tasks and/or Evals asynchronous and you can 10x the speed of your runs.

Code errors for synchronous tasks break at the line of the error in the code. They are easier to debug and we recommend using these to develop your tasks and evals.

The synchronous running of an experiment runs one after another. The asynchronous running of an experiment runs in parallel.

Here are some code differences between the two. You just need to add the async keyword before your functions def and add async_ at the front of the name.

# Sync task
def prompt_gen_task(example):
    print('running task sync')

# Sync evaluation
def evaluate_hallu(output, dataset_row):
    print('running eval sync')

# run experiment
experiment1 = arize_client.run_experiment(space_id=space_id,
    dataset_id=dataset_id, task=prompt_gen_task, evaluators=[evaluate_hallu], 
    experiment_name="test"
)
    
###############    
import nest_asyncio
nest_asyncio.apply()

# Async task
async def async_prompt_gen_task(example):
    print('running task sync')

# Async evaluation
async def async_evaluate_hallu(output, dataset_row):
    print('running eval sync')
    
# same run experiment function
experiment1 = arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id, 
    task=async_prompt_gen_task, 
    evaluators=[async_evaluate_hallu], 
    experiment_name="test"
)

Sampling a dataset for an experiment

Running a test on dataset sometimes requires running on random or stratified samples of the dataset. Arize supports running on samples by allowing teams to download a dataframe. That dataframe can be sampled prior to running the experiment.

# Get dataset as Dataframe
dataset_df = arize_client.get_dataset(space_id=SPACE_ID, dataset_name=dataset_name)

# Any sampling methods you want on a DF
sampled_df = dataset_df.sample(n=100)  # Sample 100 rows randomly

# Sample 10% of rows randomly
sampled_df = dataset_df.sample(frac=0.1)

# Create proportional sampling based on the original dataset's class label distribution
stratified_sampled_df = dataset_df.groupby('class_label', group_keys=False).apply(lambda x: x.sample(frac=0.1))

# Select every 10th row
systematic_sampled_df = dataset_df.iloc[::10, :]

# Run Experiment on sampled_df
client.run_experiment(space_id, dataset_name, sampled_df, taskfn, evaluators)

An experiment will only matched up with the data that was run against it. You can run experiments with different samples of the same dataset. The platform will take care of tracking and visualization.

Any complex sampling method that can be applied to a dataframe can be used for sampling.

Tracing your experiment

When running experiments, arize_client.run_experiment() will produce a task span attached to the experiment. If you want to add more traces on the experimental run, you can actually instrument any part of that experiment and they will get attached below the task span

Arize tracers instrumented on experiment code will automatically trace the experiments into the platform.

Tracing Using Explicit Spans

from opentelemetry import trace

# Outer function will be traced by Arize with a span
def task_add_1(dataset_row):
    tracer = trace.get_tracer(__name__)
    
    # Start the span for the function
    with tracer.start_as_current_span("test_function") as span:
        # Extract the number from the dataset row
        num = dataset_row['attributes.my_number']
        
        # Set 'num' as a span attribute
        span.set_attribute("dataset.my_number", num)
        
    # Return the incremented number
    return num + 1

Tracing Using Auto-Instrumentor

# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor
# Automatic instrumentation --- This will trace all tasks below with LLM Calls
OpenAIInstrumentor().instrument()

task_prompt_template = "Answer in a few words: {question}"
openai_client = OpenAI()
def task(dataset_row) -> str:
    question = dataset_row["question"]
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content

Last updated 13 hours ago

Was this helpful?