Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.
Setup
Launch phoenix in a notebook. If you already have phoenix server running, skip this step.
import phoenix as pxpx.launch_app()
Datasets
Upload a dataset.
import pandas as pdimport phoenix as pxdf = pd.DataFrame( [ {"question": "What is Paul Graham known for?","answer": "Co-founding Y Combinator and writing on startups and techology.","metadata": {"topic": "tech"}, } ])phoenix_client = px.Client()dataset = phoenix_client.upload_dataset( dataframe=df, dataset_name="test-dataset", input_keys=["question"], output_keys=["answer"], metadata_keys=["metadata"],)
Tasks
Create a task to evaluate.
from openai import OpenAIfrom phoenix.experiments.types import Exampleopenai_client =OpenAI()task_prompt_template ="Answer in a few words: {question}"deftask(example: Example) ->str: question = example.input["question"] message_content = task_prompt_template.format(question=question) response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": message_content}] )return response.choices[0].message.content
Evaluators
Use pre-built evaluators to grade task output with code...
from phoenix.experiments.evaluators import ContainsAnyKeywordcontains_keyword =ContainsAnyKeyword(keywords=["Y Combinator", "YC"])
or LLMs.
from phoenix.experiments.evaluators import ConcisenessEvaluatorfrom phoenix.evals.models import OpenAIModelmodel =OpenAIModel(model="gpt-4o")conciseness =ConcisenessEvaluator(model=model)
from phoenix.experiments.evaluators import create_evaluatorfrom typing import Any, Dicteval_prompt_template ="""Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.Output only a single word (accurate or inaccurate).QUESTION: {question}REFERENCE_ANSWER: {reference_answer}ANSWER: {answer}ACCURACY (accurate / inaccurate):"""@create_evaluator(kind="llm")# need the decorator or the kind will default to "code"defaccuracy(input: Dict[str, Any],output:str,expected: Dict[str, Any]) ->float: message_content = eval_prompt_template.format( question=input["question"], reference_answer=expected["answer"], answer=output ) response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": message_content}] ) response_message_content = response.choices[0].message.content.lower().strip()return1.0if response_message_content =="accurate"else0.0
Experiments
Run an experiment and evaluate the results.
from phoenix.experiments import run_experimentexperiment =run_experiment( dataset, task, experiment_name="initial-experiment", evaluators=[jaccard_similarity, accuracy],)
Run more evaluators after the fact.
from phoenix.experiments import evaluate_experimentexperiment =evaluate_experiment(experiment, evaluators=[contains_keyword, conciseness])
And iterate 🚀
Dry Run
Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment() and evaluate_experiment() both are equipped with a dry_run= parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True selects one sample from the dataset, and setting it to a number, e.g. dry_run=3, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.