Ask or search…

Phoenix Evals

Evaluate your LLM application with Phoenix
This quickstart shows how Phoenix helps you evaluate data from your LLM application.
You will:
  • Export a dataframe from your Phoenix session that contains traces from an instrumented LLM application,
  • Evaluate your trace data for:
    • Relevance: Are the retrieved documents grounded in the response?
    • Q&A correctness: Are your application's responses grounded in the retrieved context?
    • Hallucinations: Is your application making up false information?
  • Ingest the evaluations into Phoenix to see the results annotated on the corresponding spans and traces.
Let's get started!
First, install Phoenix with pip install arize-phoenix.
To get you up and running quickly, we'll download some pre-existing trace data collected from a LlamaIndex application (in practice, this data would be collected by instrumenting your LLM application with an OpenInference-compatible tracer).
from urllib.request import urlopen
from phoenix.trace.trace_dataset import TraceDataset
from phoenix.trace.utils import json_lines_to_df
traces_url = ""
with urlopen(traces_url) as response:
lines = [line.decode("utf-8") for line in response.readlines()]
trace_ds = TraceDataset(json_lines_to_df(lines))
Launch Phoenix. You can open use Phoenix within your notebook or in a separate browser window by opening the URL.
import phoenix as px
session = px.launch_app(trace=trace_ds)
You should now see a view like this.
A view of the Phoenix UI prior to adding evaluation annotations
Export your retrieved documents and query data from your session into a pandas dataframe.
If you are interested in a niche subset of your data, you can export with a custom query. Learn more here.
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
retrieved_documents_df = get_retrieved_documents(px.Client())
queries_df = get_qa_with_reference(px.Client())
Phoenix evaluates your application data by prompting an LLM to classify whether a retrieved document is relevant or irrelevant to the corresponding query, whether a response is grounded in a retrieved document, etc. You can even get explanations generated by the LLM to help you understand the results of your evaluations!
This quickstart uses OpenAI and requires an OpenAI API key, but we support a wide variety of APIs and models.
Install the OpenAI SDK with pip install openai and instantiate your model.
from phoenix.experimental.evals import OpenAIModel
api_key = None # set your api key here or with the OPENAI_API_KEY environment variable
eval_model = OpenAIModel(model_name="gpt-4-turbo-preview", api_key=api_key)
You'll next define your evaluators. Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.
A diagram depicting how evaluators are composed of LLMs and evaluation prompt templates and product labels, scores, and explanations from input data (e.g., queries, references, outputs, etc.)
from phoenix.experimental.evals import (
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)
Run your evaluations.
import nest_asyncio
from phoenix.experimental.evals import (
nest_asyncio.apply() # needed for concurrency in notebook environments
hallucination_eval_df, qa_correctness_eval_df = run_evals(
evaluators=[hallucination_evaluator, qa_correctness_evaluator],
relevance_eval_df = run_evals(
Log your evaluations to your running Phoenix session.
from phoenix.trace import DocumentEvaluations, SpanEvaluations
SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
Your evaluations should now appear as annotations on your spans in Phoenix!
print(f"🔥🐦 Open back up Phoenix in case you closed it: {session.url}")
You can view aggregate evaluation statistics, surface problematic spans, understand the LLM's reason for each evaluation by reading the corresponding explanation, and pinpoint the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's poor responses.
A view of the Phoenix UI with evaluation annotations