Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.
Background + demo on datasets
Setup
Launch phoenix in a notebook. If you already have phoenix server running, skip this step.
import phoenix as px
px.launch_app()
Datasets
Upload a dataset.
import pandas as pd
import phoenix as px
df = pd.DataFrame(
[
{
"question": "What is Paul Graham known for?",
"answer": "Co-founding Y Combinator and writing on startups and techology.",
"metadata": {"topic": "tech"},
}
]
)
phoenix_client = px.Client()
dataset = phoenix_client.upload_dataset(
dataframe=df,
dataset_name="test-dataset",
input_keys=["question"],
output_keys=["answer"],
metadata_keys=["metadata"],
)
Tasks
Create a task to evaluate.
from openai import OpenAI
from phoenix.experiments.types import Example
openai_client = OpenAI()
task_prompt_template = "Answer in a few words: {question}"
def task(example: Example) -> str:
question = example.input["question"]
message_content = task_prompt_template.format(question=question)
response = openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": message_content}]
)
return response.choices[0].message.content
Evaluators
Use pre-built evaluators to grade task output with code...
from phoenix.experiments.evaluators import ContainsAnyKeyword
contains_keyword = ContainsAnyKeyword(keywords=["Y Combinator", "YC"])
or LLMs.
from phoenix.experiments.evaluators import ConcisenessEvaluator
from phoenix.evals.models import OpenAIModel
model = OpenAIModel(model="gpt-4o")
conciseness = ConcisenessEvaluator(model=model)
from phoenix.experiments.evaluators import create_evaluator
from typing import Any, Dict
eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).
QUESTION: {question}
REFERENCE_ANSWER: {reference_answer}
ANSWER: {answer}
ACCURACY (accurate / inaccurate):
"""
@create_evaluator(kind="llm") # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
message_content = eval_prompt_template.format(
question=input["question"], reference_answer=expected["answer"], answer=output
)
response = openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role": "user", "content": message_content}]
)
response_message_content = response.choices[0].message.content.lower().strip()
return 1.0 if response_message_content == "accurate" else 0.0
from phoenix.experiments import evaluate_experiment
experiment = evaluate_experiment(experiment, evaluators=[contains_keyword, conciseness])
And iterate 🚀
Dry Run
Sometimes we may want to do a quick sanity check on the task function or the evaluators before unleashing them on the full dataset. run_experiment() and evaluate_experiment() both are equipped with a dry_run= parameter for this purpose: it executes the task and evaluators on a small subset without sending data to the Phoenix server. Setting dry_run=True selects one sample from the dataset, and setting it to a number, e.g. dry_run=3, selects multiple. The sampling is also deterministic, so you can keep re-running it for debugging purposes.