Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it's a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.
Setting Up an Automated Experiment
This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.
To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'
1. Define the Experiment File
The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.
The first step is to set up and retrieve your dataset using the ArizeDatasetsClient:
arize_client =ArizeDatasetsClient(developer_key=ARIZE_API_KEY)# Get the current dataset versiondataset = arize_client.get_dataset( space_id=SPACE_ID, dataset_id=DATASET_ID, dataset_version="2024-08-11 23:01:04")
Task
Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:
deftask(example) ->str:## You can import directly from your repo to automatically grab the latest versionfrom prompt_func.search.search_router import ROUTER_TEMPLATEprint("running task") prompt_vars = json.loads( example.dataset_row["attributes.llm.prompt_template.variables"] ) response = client.chat.completions.create( model=TASK_MODEL, temperature=0, messages=[ {"role": "system", "content": ROUTER_TEMPLATE}, ], tools=avail_tools, ) tool_response = response.choices[0].message.tool_callsreturn tool_responsedefrun_task(example) ->str:returntask(example)
Evaluator
An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment: