Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.
Setting Up an Automated Experiment
This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.
To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'
1. Define the Experiment File
The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.
Imports
import pandas as pd
from phoenix.evals import llm_classify, OpenAIModel
from openai import OpenAI
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
EvaluationResult,
Evaluator,
)
from openai.types.chat import ChatCompletionToolParam
Dataset
The first step is to set up and retrieve your dataset using the ArizeDatasetsClient:
arize_client = ArizeDatasetsClient(developer_key=ARIZE_API_KEY)
# Get the current dataset version
dataset = arize_client.get_dataset(
space_id=SPACE_ID, dataset_id=DATASET_ID, dataset_version="2024-08-11 23:01:04"
)
Task
Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:
def task(example) -> str:
## You can import directly from your repo to automatically grab the latest version
from prompt_func.search.search_router import ROUTER_TEMPLATE
print("running task")
prompt_vars = json.loads(
example.dataset_row["attributes.llm.prompt_template.variables"]
)
response = client.chat.completions.create(
model=TASK_MODEL,
temperature=0,
messages=[
{"role": "system", "content": ROUTER_TEMPLATE},
],
tools=avail_tools,
)
tool_response = response.choices[0].message.tool_calls
return tool_response
def run_task(example) -> str:
return task(example)
Evaluator
An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment: