CI/CD for Automated Experiments

Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changesโ€”whether it's a tweak to a prompt, model, or functionโ€”using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.

Setting Up an Automated Experiment

This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.

To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'

1. Define the Experiment File

The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.

Imports

import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from openai import OpenAI
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from openai.types.chat import ChatCompletionToolParam

Dataset

The first step is to set up and retrieve your dataset using the ArizeDatasetsClient:

arize_client = ArizeDatasetsClient(developer_key=ARIZE_API_KEY)

# Get the current dataset version
dataset = arize_client.get_dataset(
    space_id=SPACE_ID, dataset_id=DATASET_ID, dataset_version="2024-08-11 23:01:04"
)

Task

Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:

def task(example) -> str:
    ## You can import directly from your repo to automatically grab the latest version
    from prompt_func.search.search_router import ROUTER_TEMPLATE
    print("running task")
    prompt_vars = json.loads(
        example.dataset_row["attributes.llm.prompt_template.variables"]
    )

    response = client.chat.completions.create(
        model=TASK_MODEL,
        temperature=0,
        messages=[
            {"role": "system", "content": ROUTER_TEMPLATE},
        ],
        tools=avail_tools,
    )
    tool_response = response.choices[0].message.tool_calls
    return tool_response

def run_task(example) -> str:
    return task(example)

Evaluator

An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment:

class CorrectnessEvaluator(Evaluator):
    annotator_kind = "CODE"
    name = "ai_search_correctness"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        print("evaluating outputs")
        output = output
        expected_output = dataset_row["attributes.llm.output_messages"]
        df_in = pd.DataFrame(
            {"selected_output": output, "expected_output": expected_output}, index=[0]
        )

        expect_df = llm_classify(
            dataframe=df_in,
            template=EVALUATOR_TEMPLATE,
            model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
            rails=["correct", "incorrect"],
            provide_explanation=True,
        )

        label = expect_df["label"][0]
        score = 1 if label == "correct" else 0
        explanation = expect_df["explanation"][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)

Run the Experiment

Configure and initiate your experiment using run_experiment:

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[CorrectnessEvaluator()],
    experiment_name="Your_Experiment_Name"
)

Advanced Experiment Management

You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.

from gql import Client, gql
from gql.transport.requests import RequestsHTTPTransport

def fetch_experiment_details(gql_client, dataset_id):
    experiments_query = gql(
        """
        query getExperimentDetails($DatasetId:ID!){
        node(id: $DatasetId) {
            ... on Dataset {
            name
            experiments(first: 1){
                edges{
                node{
                    name
                    createdAt
                    evaluationScoreMetrics{
                        name
                        meanScore
                    }
                }
                }
            }
            }
        }
        }
        """
    )

    params = {"DatasetId": dataset_id}
    response = gql_client.execute(experiments_query, params)
    experiments = response["node"]["experiments"]["edges"]
    
    experiments_list = []
    for experiment in experiments:
        node = experiment["node"]
        experiment_name = node["name"]
        for metric in node["evaluationScoreMetrics"]:
            experiments_list.append([
                experiment_name,
                metric["name"],
                metric["meanScore"]
            ])
    
    return experiments_list

This function returns a list of experiments with their names, metric names, and mean scores.

Determine Experiment Success

You can use the mean score from an experiment to automatically determine if it passed or failed:

def determine_experiment_success(experiment_result):
    success = experiment_result > 0.7
    sys.exit(0 if success else 1)

This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.

Auto-increment Experiment Names

To ensure unique experiment names, you can automatically increment the version number:

def increment_experiment_name(experiment_name):
    ## example name: AI Search V1.1
    match = re.search(r"V(\d+)\.(\d+)", experiment_name)
    if not match:
        return experiment_name

    major, minor = map(int, match.groups())
    new_version = f"V{major}.{minor + 1}"
    return re.sub(r"V\d+\.\d+", new_version, experiment_name)

2. Define Workflow File

  • Workflow files are stored in the .github/workflows directory of your repository.

  • Workflow files use YAML syntax and have a .yml extension

Example WorkFlow File:

name: AI Search - Correctness Check

on:
  push:
    paths:
      - copilot/search


jobs:
  run-script:
    runs-on: ubuntu-latest
    env:
      OPENAI_KEY: ${{ secrets.OPENAI_KEY }}  

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'
    - name: Run script
      run: python ./copilot/experiments/ai_search_test.py

Last updated

Copyright ยฉ 2023 Arize AI, Inc