Last updated
Copyright ยฉ 2023 Arize AI, Inc
Last updated
Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create that automatically validate changesโwhether it's a tweak to a prompt, model, or functionโusing a curated dataset and your preferred evaluation method. These tests can be integrated with so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.
This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.
To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'
The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.
Dataset
Task
Define the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you're aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:
Evaluator
Configure and initiate your experiment using run_experiment
:
You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.
This function returns a list of experiments with their names, metric names, and mean scores.
Determine Experiment Success
You can use the mean score from an experiment to automatically determine if it passed or failed:
This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.
Auto-increment Experiment Names
To ensure unique experiment names, you can automatically increment the version number:
Workflow files are stored in the .github/workflows
directory of your repository.
Workflow files use YAML syntax and have a .yml
extension
The first step is to set up and retrieve your using the ArizeDatasetsClient
:
An evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from to . The evaluator is central to testing and validating the outcomes of your experiment: