This guide helps you run experiments to test and validate changes in your LLM applications against a curated dataset. Learn more about the concepts of experiments here.
You can create and run an experiment with the following steps:
from arize.experimental.datasets import ArizeDatasetsClientclient =ArizeDatasetsClient(developer_key="*****")
Create your dataset
You can copy paste the sample code below, but for a more comprehensive guide, checkout the quickstart: datasets guide.
import pandas as pdfrom arize.experimental.datasets.utils.constants import GENERATIVEdata = [{"persona":"An aspiring musician who is writing their own songs","problem":"I often get stuck overthinking my lyrics and melodies.",}]df = pd.DataFrame(data)dataset_id = client.create_dataset( space_id="YOUR_SPACE_ID", dataset_name="Your Dataset", dataset_type=GENERATIVE, data=df)print(f"Dataset created with ID: {dataset_id}")
Define your LLM task
Here is where you define the LLM task you are trying to test for its effectiveness. This could be structured data extraction, SQL generation, chatbot response generation, search, or any task of your choosing.
The input is the dataset_row so you can access the variables in your dataset, and the output is a string.
# Define our prompt for an AI product managerPROMISE_GENERATOR = """You are an expert product manager. I will provide you with context on my current thinking for persona and problem. Your job is to suggest the best promise based on the context provided. The promise should help the persona reach an outcome they care about and solve their primary problem. Return only the promise and nothing else.
[BEGIN CONTEXT]Persona:{persona}Problem:{problem}[END CONTEXT]"""# Setup our OpenAI clientfrom openai import OpenAIopenai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))# Define our taskdef task(dataset_row) -> str: # Import our variables defined above from our dataset problem = dataset_row["problem"] persona = dataset_row["persona"] # Setup the prompt user_message = { "role": "user", "content": PROMISE_GENERATOR.format(problem=problem, persona=persona) } # Run the LLM and return the response response = openai_client.chat.completions.create( model="gpt-4o", messages=[user_message], temperature=1 ) return response.choices[0].message.content
Define your evaluation criteria
An evaluator is any function that (1) takes the task output and (2) returns an assessment. This gives you tremendous flexibility to write your own LLM judge using a custom template, or use code-based evaluators.
In this example, we will create our own custom LLM template to judge whether the output satisfied the criteria. There are a few key concepts you need to know to create an evaluator.
class Evaluator - when creating a custom Evaluator, you need to define its name and the Evaluator.evaluate() function.
Evaluator.evaluate(self, output, dataset_row) - this function takes in the output from the LLM task above along with the dataset_row, which has context and prompt variables, so you can assess the quality of the task output. You must return an EvaluationResult.
EvaluationResult - this class requires three inputs so we can display evalutions correctly in Arize.
score - this is a numeric score which we aggregate and display in the UI across your experiment runs. Must be a value in between 0 and 1, inclusive.
label - this is a classification label to determine success or failure of your evaluation, such as ("Hallucinated", "Factual").
explanation - this is a string where you can provide why the output passed or failed the evaluation criteria.
llm_classify - this function uses Arize Phoenix to run LLM-based evaluations. See here for the API reference.
# import our librariesfrom arize.experimental.datasets.experiments.evaluators.base import ( EvaluationResult, Evaluator)from phoenix.evals import ( OpenAIModel, llm_classify,)import pandas as pd# Create out Evaluation TemplateCUSTOM_EVAL_TEMPLATE = """You are evaluating whether the product promise satisfies the following criteria given the following context.
[Persona]:{persona}[Problem]:{problem}[Product Promise]:{output}[BEGIN CRITERIA]The product promise should be something that the persona would be excited to try.[END CRITERIA]Your potential answers:(A) The product promise is compelling to this persona to solve this problem.(B) The product promise isnot compelling to this persona for this problem.(C) It cannot be determined whether this is compelling to the user.Return a single letter from this list ["A","B","C"]"""class PromiseEvaluator(Evaluator): name = "Promise Evaluator" def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult: # Get prompt variables problem = dataset_row['problem'] persona = dataset_row['persona'] # Setup rails and dataframe rails = ["A", "B", "C"] df_in = pd.DataFrame({"output": output, "problem": problem, "persona": persona}, index=[0]) print(df_in) # Run evaluation eval_df = llm_classify( dataframe=df_in, template=CUSTOM_EVAL_TEMPLATE, model=OpenAIModel(model="gpt-4o"), rails=rails, provide_explanation=True ) # Output results for experiment label = eval_df['label'][0] score = 1 if label.upper() == "A" else 0 explanation = eval_df['explanation'][0] return EvaluationResult(score=score, label=label, explanation=explanation)
Run the experiment
## Run Experimentclient.run_experiment( space_id=space_id, dataset_id=dataset_id, task=task, evaluators=[PromiseEvaluator()], experiment_name="PM-experiment",)
See the experiment result in Arize
Navigate to the dataset in the UI and see the experiment output table