LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Define your dataset
  • Define your task function
  • Task inputs
  • Run the experiment
  • Advanced Options
  • Setup asynchronous experiments
  • Sampling a dataset for an experiment
  • Tracing your experiment

Was this helpful?

  1. Develop
  2. Experiments

Run experiments with code

How to create an LLM task to use in experiments

Last updated 7 hours ago

Was this helpful?

To run an experiment in code, you need to define the following things:

And then use the run_experiment function to run the task function against your data, run the evaluation function against the outputs, and log the results and traces to Arize.

Alternatively, you can if you already have your LLM outputs or evaluation outputs.

Define your dataset

You can a new dataset, or you can your existing dataset in code.

arize_client = ArizeDatasetsClient(api_key=API_KEY)

dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)

Define your task function

A task is any function that you want to test on a dataset. The simplest version of a task looks like the following:

def task(dataset_row: Dict):
    return dataset_row

When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.

Here's an example of how the data passes through the task function from your dataset.

"""Example dataset
dataframe = pd.DataFrame({
    "id":[1,2], 
    "attribute": ["example", "example2"],
    "question": ["what is 1+1", "why is the sky blue"],
    "expected": ["2", "because i said so"]
})
"""

# for the first row of your data, the answer is in the comments
def task(dataset_row: Dict):
    question = dataset_row.get("question") # what is 1+1
    expected = dataset_row.get("expected") # 2
    attribute = dataset_row.get("attribute") # example
    data_id = dataset_row.get("id") # 1
    return expected

Let's create a task that uses an LLM to answer a question.

import openai

def answer_question(dataset_row) -> str:
    question = dataset_row.get("question") # what is 1+1
    openai_client = openai.OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        max_tokens=20,
    )
    assert response.choices
    return response.choices[0].message.content

Task inputs

The task function can take the following optional arguments for convenience, which will automatically pass dataset_row attributes to your task function. The easiest way to access anything you need is using dataset_row.

Parameter name
Description
Dataset Row Attribute
Example

dataset_row

the entire row of the data, including every column as dictionary key

--

def task_fn(dataset_row): ...

input

experiment run input

attributes.input.value

def task_fn(input): ...

expected

the expected output

attributes.output.value

def task_fn(expected): ...

metadata

metadata for the function

attributes.metadata

def task_fn(metadata): ...

Run the experiment

arize_client.run_experiment(
    space_id="",
    dataset_id="", 
    task=answer_question,
    evaluators=[], 
    experiment_name="",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

We offer several convenience attributes, such as concurrency to reduce time to complete the experiment. You can specify dry_run=True , which does not log the result to Arize. You can also specify exit_on_error=True, which makes it easier to debug when an experiment doesn't run correctly.

Once your experiment has finished running, you can visualize your experiment results in the Arize UI.

Advanced Options

Setup asynchronous experiments

Experiments can be run as either Synchronous or Asynchronous.

We recommend:

  • Synchronous: Slower but easier to debug. When you are building your tests these are inherently easier to debug. Start with synchronous and then make them asynchronous.

  • Asynchronous: Faster. When timing and speed of the tests matter. Make the tasks and/or Evals asynchronous and you can 10x the speed of your runs.

Code errors for synchronous tasks break at the line of the error in the code. They are easier to debug and we recommend using these to develop your tasks and evals.

The synchronous running of an experiment runs one after another. The asynchronous running of an experiment runs in parallel.

Here are some code differences between the two. You just need to add the async keyword before your functions, and run nest_asyncio.apply() .

# Sync task
def prompt_gen_task(example):
    print('running task sync')

# Sync evaluation
def evaluate_hallu(output, dataset_row):
    print('running eval sync')

# run experiment
experiment1 = arize_client.run_experiment(space_id=space_id,
    dataset_id=dataset_id, task=prompt_gen_task, evaluators=[evaluate_hallu], 
    experiment_name="test"
)
    
###############
    
# Async task
async def prompt_gen_task(example):
    print('running task sync')

# Async evaluation
async def evaluate_hallu(output, dataset_row):
    print('running eval sync')
    
# Must import this library for async tasks to run properly
import nest_asyncio
nest_asyncio.apply()

# same run experiment function
experiment1 = arize_client.run_experiment(space_id=space_id,
    dataset_id=dataset_id, task=prompt_gen_task, evaluators=[evaluate_hallu], 
    experiment_name="test"
)

Sampling a dataset for an experiment

Running a test on dataset sometimes requires running on random or stratified samples of the dataset. Arize supports running on samples by allowing teams to download a dataframe. That dataframe can be sampled prior to running the experiment.

# Get dataset as Dataframe
dataset_df = arize_client.get_dataset(space_id=SPACE_ID, dataset_name=dataset_name)

# Any sampling methods you want on a DF
sampled_df = dataset_df.sample(n=100)  # Sample 100 rows randomly

# Sample 10% of rows randomly
sampled_df = dataset_df.sample(frac=0.1)

# Create proportional sampling based on the original dataset's class label distribution
stratified_sampled_df = dataset_df.groupby('class_label', group_keys=False).apply(lambda x: x.sample(frac=0.1))

# Select every 10th row
systematic_sampled_df = dataset_df.iloc[::10, :]

# Run Experiment on sampled_df
client.run_experiment(space_id, dataset_name, sampled_df, taskfn, evaluators)

An experiment will only matched up with the data that was run against it. You can run experiments with different samples of the same dataset. The platform will take care of tracking and visualization.

Any complex sampling method that can be applied to a dataframe can be used for sampling.

Tracing your experiment

When running experiments, arize_client.run_experiment() will produce a task span attached to the experiment. If you want to add more traces on the experimental run, you can actually instrument any part of that experiment and they will get attached below the task span

Arize tracers instrumented on experiment code will automatically trace the experiments into the platform.

Tracing Using Explicit Spans

from opentelemetry import trace

# Outer function will be traced by Arize with a span
def task_add_1(dataset_row):
    tracer = trace.get_tracer(__name__)
    
    # Start the span for the function
    with tracer.start_as_current_span("test_function") as span:
        # Extract the number from the dataset row
        num = dataset_row['attributes.my_number']
        
        # Set 'num' as a span attribute
        span.set_attribute("dataset.my_number", num)
        
    # Return the incremented number
    return num + 1

Tracing Using Auto-Instrumentor

# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor
# Automatic instrumentation --- This will trace all tasks below with LLM Calls
OpenAIInstrumentor().instrument()

task_prompt_template = "Answer in a few words: {question}"
openai_client = OpenAI()
def task(dataset_row) -> str:
    question = dataset_row["question"]
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content

To run the experiment with this task, you can load the task into as following.

🧪
run_experiment
A dataset
An evaluation function (optional)
log your experiment results via code
create
export
A task function
Logodatasets & experiments — Arize API Reference 7.36.0 documentation
SDK API reference for run_experiment
LogoGoogle Colab
Example of Tracing LangGraph Experiments, with LangGraph spans under task span
Synchronous Task and Eval
Async
The action for viewing the trace for any experiment run
Viewing the experiment trace