LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
      • Replay spans
      • Compare prompts side-by-side
      • Load a dataset into playground
      • Save playground outputs as an experiment
      • ✨Copilot: prompt builder
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Logging Latent Metadata
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Setting Up an Automated Experiment
  • 1. Define the Experiment File
  • 2. Define Workflow (CI/CD) File
  • Example WorkFlow File:

Was this helpful?

  1. Develop
  2. Experiments

CI/CD with experiments

Last updated 2 months ago

Was this helpful?

Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with or so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.

Setting Up an Automated Experiment

This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.

To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'

1. Define the Experiment File

The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.

Imports

import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from openai import OpenAI
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from openai.types.chat import ChatCompletionToolParam

Dataset

arize_client = ArizeDatasetsClient(developer_key=ARIZE_API_KEY)

# Get the current dataset version
dataset = arize_client.get_dataset(
    space_id=SPACE_ID, dataset_id=DATASET_ID, dataset_version="2024-08-11 23:01:04"
)

Task

Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:

def task(example) -> str:
    ## You can import directly from your repo to automatically grab the latest version
    from prompt_func.search.search_router import ROUTER_TEMPLATE
    print("running task")
    prompt_vars = json.loads(
        example.dataset_row["attributes.llm.prompt_template.variables"]
    )

    response = client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
        ],
        tools=avail_tools,
    )
    tool_response = response.choices[0].message.tool_calls
    return tool_response

def run_task(example) -> str:
    return task(example)

Evaluator

An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment:

def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
    print("evlauating outputs")
    output = output
    expected_output = dataset_row["attributes.llm.output_messages"]
    df_in = pd.DataFrame(
        {"selected_output": output, "expected_output": expected_output}, index=[0]
    )
    rails = ["incorrect", "correct"]
    expect_df = llm_classify(
        dataframe=df_in,
        template=EVALUATOR_TEMPLATE,
        model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
        rails=rails,
        provide_explanation=True,
    )

    label = expect_df["label"][0]
    score = 1 if label == "correct" else 0
    explanation = expect_df["explanation"][0]
    return EvaluationResult(score=score, label=label, explanation=explanation)

Run the Experiment

Configure and initiate your experiment using run_experiment:

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[function_selection],
    experiment_name="Your_Experiment_Name"
)

Advanced Experiment Management

You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.

from gql import Client, gql
from gql.transport.requests import RequestsHTTPTransport

def fetch_experiment_details(gql_client, dataset_id):
    experiments_query = gql(
        """
        query getExperimentDetails($DatasetId:ID!){
        node(id: $DatasetId) {
            ... on Dataset {
            name
            experiments(first: 1){
                edges{
                node{
                    name
                    createdAt
                    evaluationScoreMetrics{
                        name
                        meanScore
                    }
                }
                }
            }
            }
        }
        }
        """
    )

    params = {"DatasetId": dataset_id}
    response = gql_client.execute(experiments_query, params)
    experiments = response["node"]["experiments"]["edges"]
    
    experiments_list = []
    for experiment in experiments:
        node = experiment["node"]
        experiment_name = node["name"]
        for metric in node["evaluationScoreMetrics"]:
            experiments_list.append([
                experiment_name,
                metric["name"],
                metric["meanScore"]
            ])
    
    return experiments_list

This function returns a list of experiments with their names, metric names, and mean scores.

Determine Experiment Success

You can use the mean score from an experiment to automatically determine if it passed or failed:

def determine_experiment_success(experiment_result):
    success = experiment_result > 0.7
    sys.exit(0 if success else 1)

This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.

Auto-increment Experiment Names

To ensure unique experiment names, you can automatically increment the version number:

def increment_experiment_name(experiment_name):
    ## example name: AI Search V1.1
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name

    major, minor = map(int, match.groups())
    new_version = f"V{major}.{minor + 1}"
    return re.sub(r"V\d+\.\d+", new_version, experiment_name)

2. Define Workflow (CI/CD) File

Github Actions:

  • Workflow files are stored in the .github/workflows directory of your repository.

  • Workflow files use YAML syntax and have a .yml extension

Example WorkFlow File:

name: AI Search - Correctness Check

on:
  push:
    paths:
      - copilot/search


jobs:
  run-script:
    runs-on: ubuntu-latest
    env:
      OPENAI_KEY: ${{ secrets.OPENAI_KEY }}  

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - name: Run script
      run: python ./copilot/experiments/ai_search_test.py

Gitlab CI/CD

GitLab CI/CD pipelines are defined in a .gitlab-ci.yml file stored in the root of your repository. You can use YAML syntax to define your pipeline.

Example .gitlab-ci.yml File:

stages:
  - test

variables:
  # These variables need to be defined in GitLab CI/CD settings
  # The $ syntax is how GitLab references variables
  OPENAI_API_KEY: $OPENAI_API_KEY
  ARIZE_API_KEY: $ARIZE_API_KEY
  SPACE_ID: $SPACE_ID
  DATASET_ID: $DATASET_ID

llm-experiment-job:
  stage: test
  image: python:3.10
  # The 'only' directive specifies when this job should run
  # This will run for merge requests that change files in copilot/search
  only:
    refs:
      - merge_requests
    changes:
      - copilot/search/**/*
  script:
    - pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - python ./copilot/experiments/ai_search_test.py
  artifacts:
    paths:
      - experiment_results.json
    expire_in: 1 week

The first step is to set up and retrieve your using the ArizeDatasetsClient:

🧪
GitHub Actions,
GitLab CI/CD
dataset