Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.
Setup
Launch phoenix in a notebook. If you already have phoenix server running, skip this step.
import phoenix as px
px.launch_app()
Datasets
Upload a dataset.
import pandas as pd
import phoenix as px
df = pd.DataFrame(
[
{
"question": "What is Paul Graham known for?",
"answer": "Co-founding Y Combinator and writing on startups and techology.",
"metadata": {"topic": "tech"},
}
]
)
phoenix_client = px.Client()
dataset = phoenix_client.upload_dataset(
dataframe=df,
dataset_name="test-dataset",
input_keys=["question"],
output_keys=["answer"],
metadata_keys=["metadata"],
)
Tasks
Create a task to evaluate.
from openai import OpenAI
from phoenix.experiments.types import Example
openai_client = OpenAI()
task_prompt_template = "Answer in a few words: {question}"
def task(example: Example) -> str:
question = example.input["question"]
message_content = task_prompt_template.format(question=question)
response = openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": message_content}]
)
return response.choices[0].message.content
Evaluators
Use pre-built evaluators to grade task output with code...
from phoenix.experiments.evaluators import ContainsAnyKeyword
contains_keyword = ContainsAnyKeyword(keywords=["Y Combinator", "YC"])
or LLMs.
from phoenix.experiments.evaluators import ConcisenessEvaluator
from phoenix.evals.models import OpenAIModel
model = OpenAIModel(model="gpt-4o")
conciseness = ConcisenessEvaluator(model=model)
from phoenix.experiments.evaluators import create_evaluator
from typing import Any, Dict
eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).
QUESTION: {question}
REFERENCE_ANSWER: {reference_answer}
ANSWER: {answer}
ACCURACY (accurate / inaccurate):
"""
@create_evaluator(kind="llm") # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
message_content = eval_prompt_template.format(
question=input["question"], reference_answer=expected["answer"], answer=output
)
response = openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": message_content}]
)
response_message_content = response.choices[0].message.content.lower().strip()
return 1.0 if response_message_content == "accurate" else 0.0
from phoenix.experiments import evaluate_experiment
experiment = evaluate_experiment(experiment, evaluators=[contains_keyword, conciseness])
And iterate 🚀
Dry Run
Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment() and evaluate_experiment() both are equipped with a dry_run= parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True selects one sample from the dataset, and setting it to a number, e.g. dry_run=3, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.