CI/CD for Automated Experiments

Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, or GitLab CI/CD so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.

Setting Up an Automated Experiment

This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.

To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'

1. Define the Experiment File

The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.

Imports

import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from openai import OpenAI
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from openai.types.chat import ChatCompletionToolParam

Dataset

arize_client = ArizeDatasetsClient(developer_key=ARIZE_API_KEY)

# Get the current dataset version
dataset = arize_client.get_dataset(
    space_id=SPACE_ID, dataset_id=DATASET_ID, dataset_version="2024-08-11 23:01:04"
)

Task

Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:

def task(example) -> str:
    ## You can import directly from your repo to automatically grab the latest version
    from prompt_func.search.search_router import ROUTER_TEMPLATE
    print("running task")
    prompt_vars = json.loads(
        example.dataset_row["attributes.llm.prompt_template.variables"]
    )

    response = client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
        ],
        tools=avail_tools,
    )
    tool_response = response.choices[0].message.tool_calls
    return tool_response

def run_task(example) -> str:
    return task(example)

Evaluator

An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment:

def function_selection(output, dataset_row, **kwargs) -> EvaluationResult:
    print("evlauating outputs")
    output = output
    expected_output = dataset_row["attributes.llm.output_messages"]
    df_in = pd.DataFrame(
        {"selected_output": output, "expected_output": expected_output}, index=[0]
    )
    rails = ["incorrect", "correct"]
    expect_df = llm_classify(
        dataframe=df_in,
        template=EVALUATOR_TEMPLATE,
        model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
        rails=rails,
        provide_explanation=True,
    )

    label = expect_df["label"][0]
    score = 1 if label == "correct" else 0
    explanation = expect_df["explanation"][0]
    return EvaluationResult(score=score, label=label, explanation=explanation)

Run the Experiment

Configure and initiate your experiment using run_experiment:

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[function_selection],
    experiment_name="Your_Experiment_Name"
)

Advanced Experiment Management

You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.

from gql import Client, gql
from gql.transport.requests import RequestsHTTPTransport

def fetch_experiment_details(gql_client, dataset_id):
    experiments_query = gql(
        """
        query getExperimentDetails($DatasetId:ID!){
        node(id: $DatasetId) {
            ... on Dataset {
            name
            experiments(first: 1){
                edges{
                node{
                    name
                    createdAt
                    evaluationScoreMetrics{
                        name
                        meanScore
                    }
                }
                }
            }
            }
        }
        }
        """
    )

    params = {"DatasetId": dataset_id}
    response = gql_client.execute(experiments_query, params)
    experiments = response["node"]["experiments"]["edges"]
    
    experiments_list = []
    for experiment in experiments:
        node = experiment["node"]
        experiment_name = node["name"]
        for metric in node["evaluationScoreMetrics"]:
            experiments_list.append([
                experiment_name,
                metric["name"],
                metric["meanScore"]
            ])
    
    return experiments_list

This function returns a list of experiments with their names, metric names, and mean scores.

Determine Experiment Success

You can use the mean score from an experiment to automatically determine if it passed or failed:

def determine_experiment_success(experiment_result):
    success = experiment_result > 0.7
    sys.exit(0 if success else 1)

This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.

Auto-increment Experiment Names

To ensure unique experiment names, you can automatically increment the version number:

def increment_experiment_name(experiment_name):
    ## example name: AI Search V1.1
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name

    major, minor = map(int, match.groups())
    new_version = f"V{major}.{minor + 1}"
    return re.sub(r"V\d+\.\d+", new_version, experiment_name)

2. Define Workflow (CI/CD) File

Github Actions:

  • Workflow files are stored in the .github/workflows directory of your repository.

  • Workflow files use YAML syntax and have a .yml extension

Example WorkFlow File:

name: AI Search - Correctness Check

on:
  push:
    paths:
      - copilot/search


jobs:
  run-script:
    runs-on: ubuntu-latest
    env:
      OPENAI_KEY: ${{ secrets.OPENAI_KEY }}  

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - name: Run script
      run: python ./copilot/experiments/ai_search_test.py

Gitlab CI/CD

GitLab CI/CD pipelines are defined in a .gitlab-ci.yml file stored in the root of your repository. You can use YAML syntax to define your pipeline.

Example .gitlab-ci.yml File:

stages:
  - test

variables:
  # These variables need to be defined in GitLab CI/CD settings
  # The $ syntax is how GitLab references variables
  OPENAI_API_KEY: $OPENAI_API_KEY
  ARIZE_API_KEY: $ARIZE_API_KEY
  SPACE_ID: $SPACE_ID
  DATASET_ID: $DATASET_ID

llm-experiment-job:
  stage: test
  image: python:3.10
  # The 'only' directive specifies when this job should run
  # This will run for merge requests that change files in copilot/search
  only:
    refs:
      - merge_requests
    changes:
      - copilot/search/**/*
  script:
    - pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]'
    - python ./copilot/experiments/ai_search_test.py
  artifacts:
    paths:
      - experiment_results.json
    expire_in: 1 week

Last updated

Was this helpful?