Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
English
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
English
  • Arize Phoenix
  • Quickstarts
  • User Guide
  • Environments
  • Phoenix Demo
  • 🔭Tracing
    • Overview: Tracing
    • Quickstart: Tracing
      • Quickstart: Tracing (Python)
      • Quickstart: Tracing (TS)
    • Features: Tracing
      • Projects
      • Annotations
      • Sessions
    • Integrations: Tracing
    • How-to: Tracing
      • Setup Tracing
        • Setup using Phoenix OTEL
        • Setup using base OTEL
        • Using Phoenix Decorators
        • Setup Tracing (TS)
        • Setup Projects
        • Setup Sessions
      • Add Metadata
        • Add Attributes, Metadata, Users
        • Instrument Prompt Templates and Prompt Variables
      • Annotate Traces
        • Annotating in the UI
        • Annotating via the Client
        • Running Evals on Traces
        • Log Evaluation Results
      • Importing & Exporting Traces
        • Import Existing Traces
        • Export Data & Query Spans
        • Exporting Annotated Spans
      • Advanced
        • Mask Span Attributes
        • Suppress Tracing
        • Filter Spans to Export
        • Capture Multimodal Traces
    • Concepts: Tracing
      • How Tracing Works
      • What are Traces
      • Concepts: Annotations
      • FAQs: Tracing
  • 📃Prompt Engineering
    • Overview: Prompts
      • Prompt Management
      • Prompt Playground
      • Span Replay
      • Prompts in Code
    • Quickstart: Prompts
      • Quickstart: Prompts (UI)
      • Quickstart: Prompts (Python)
      • Quickstart: Prompts (TS)
    • How to: Prompts
      • Configure AI Providers
      • Using the Playground
      • Create a prompt
      • Test a prompt
      • Tag a prompt
      • Using a prompt
    • Concepts: Prompts
  • 🗄️Datasets & Experiments
    • Overview: Datasets & Experiments
    • Quickstart: Datasets & Experiments
    • How-to: Datasets
      • Creating Datasets
      • Exporting Datasets
    • Concepts: Datasets
    • How-to: Experiments
      • Run Experiments
      • Using Evaluators
  • 🧠Evaluation
    • Overview: Evals
      • Agent Evaluation
    • Quickstart: Evals
    • How to: Evals
      • Pre-Built Evals
        • Hallucinations
        • Q&A on Retrieved Data
        • Retrieval (RAG) Relevance
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Reference (citation) Link
        • User Frustration
        • SQL Generation Eval
        • Agent Function Calling Eval
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Audio Emotion Detection
      • Eval Models
      • Build an Eval
      • Build a Multimodal Eval
      • Online Evals
      • Evals API Reference
    • Concepts: Evals
      • LLM as a Judge
      • Eval Data Types
      • Evals With Explanations
      • Evaluators
      • Custom Task Evaluation
  • 🔍Retrieval
    • Overview: Retrieval
    • Quickstart: Retrieval
    • Concepts: Retrieval
      • Retrieval with Embeddings
      • Benchmarking Retrieval
      • Retrieval Evals on Document Chunks
  • 🌌inferences
    • Quickstart: Inferences
    • How-to: Inferences
      • Import Your Data
        • Prompt and Response (LLM)
        • Retrieval (RAG)
        • Corpus Data
      • Export Data
      • Generate Embeddings
      • Manage the App
      • Use Example Inferences
    • Concepts: Inferences
    • API: Inferences
    • Use-Cases: Inferences
      • Embeddings Analysis
  • ⚙️Settings
    • Access Control (RBAC)
    • API Keys
    • Data Retention
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page
  • Response Evaluation
  • Retrieval Evaluation
  • Evaluations
  • Evaluating Traces

Was this helpful?

Edit on GitHub
  1. Evaluation
  2. Concepts: Evals

Evaluators

Evaluation and benchmarking are crucial concepts in LLM development. To improve the performance of an LLM app (RAG, agents), you must have a way to measure it.

PreviousEvals With ExplanationsNextCustom Task Evaluation

Last updated 4 months ago

Was this helpful?

Phoenix offers key modules to measure the quality of generated results as well as modules to measure retrieval quality.

  • : Does the response match the retrieved context? Does it also match the query?

  • : Are the retrieved sources relevant to the query?

Response Evaluation

Evaluation of generated results can be challenging. Unlike traditional ML, the predicted results are not numeric or categorical, making it hard to define quantitative metrics for this problem.

Phoenix offers , a module designed to measure the quality of results. This module uses a "gold" LLM (e.g. GPT-4) to decide whether the generated answer is correct in a variety of ways. Note that many of these evaluation criteria DO NOT require ground-truth labels. Evaluation can be done simply with a combination of the input (query), output (response), and context.

LLM Evals supports the following response evaluation criteria:

  • - Whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

  • - Designed to detect LLM hallucinations relative to retrieved context

  • - Identify if the AI response is racist, biased, or toxic

Response evaluations are a critical first step to figuring out whether your LLM App is running correctly. Response evaluations can pinpoint specific executions (a.k.a. traces) that are performing badly and can be aggregated up so that you can track how your application is running as a whole.

Evaluations can be aggregated across executions to be used as KPIs

Retrieval Evaluation

Phoenix also provides evaluation of retrieval independently.

The concept of retrieval evaluation is not new; given a set of relevance scores for a set of retrieved documents, we can evaluate retrievers using retrieval metrics like precision, NDCG, hit rate and more.

LLM Evals supports the following retrieval evaluation criteria:

Retrieval is possibly the most important step in any LLM application as poor and/or incorrect retrieval can be the cause of bad response generation. If your application uses RAG to power an LLM, retrieval evals can help you identify the cause of hallucinations and incorrect answers.

Evaluations

With Phoenix's LLM Evals, evaluation results (or just Evaluations for short) is data consisting of 3 main columns:

  • label: str [optional] - a classification label for the evaluation (e.g. "hallucinated" vs "factual"). Can be used to calculate percentages (e.g. percent hallucinated) and can be used to filter down your data (e.g. Evals["Hallucinations"].label == "hallucinated")

  • score: number [optional] - a numeric score for the evaluation (e.g. 1 for good, 0 for bad). Scores are great way to sort your data to surface poorly performing examples and can be used to filter your data by a threshold.

  • explanation: str [optional] - the reasoning for why the evaluation label or score was given. In the case of LLM evals, this is the evaluation model's reasoning. While explanations are optional, they can be extremely useful when trying to understand problematic areas of your application.

Let's take a look at an example list of Q&A relevance evaluations:

label
explanation
score

correct

The reference text explains that YC was not or...

1

correct

To determine if the answer is correct, we need...

1

incorrect

To determine if the answer is correct, we must...

0

correct

To determine if the answer is correct, we need...

1

These three columns combined can drive any type of evaluation you can imagine. label provides a way to classify responses, score provides a way to assign a numeric assessment, and explanation gives you a way to get qualitative feedback.

Evaluating Traces

With Phoenix, evaluations can be "attached" to the spans and documents collected. In order to facilitate this, Phoenix supports the following steps.

  1. Querying and downloading data - query the spans collected by phoenix and materialize them into DataFrames to be used for evaluation (e.g. question and answer data, documents data).

  2. Running Evaluations - the data queried in step 1 can be fed into LLM Evals to produce evaluation results.

  3. Sorting and Filtering by Evaluation - once the evaluations have been logged back to Phoenix, the spans become instantly sortable and filterable by the evaluation values that you attached to the spans. (An example of an evaluation filter would be Eval["hallucination"].label == "hallucinated")

By following the above steps, you will have a full end-to-end flow for troubleshooting, evaluating, and root-causing an LLM application. By using LLM Evals in conjunction with Traces, you will be able to surface up problematic queries, get an explanation as to why the the generation is problematic (e.x. hallucinated because ...), and be able to identify which step of your generative app requires improvement (e.x. did the LLM hallucinate or was the LLM fed bad context?).\

For a full tutorial on LLM Ops, check out our tutorial below.

- Evaluates whether a retrieved document chunk contains an answer to the query.

Retrieval Evaluations can be run directly on application traces
Inferences that contain generative records can be fed into evals to produce evaluations for analysis
Adding evaluations on traces can highlight problematic areas that require further analysis

Logging Evaluations - the evaluations performed in the above step can be logged back to Phoenix to be attached to spans and documents for evaluating responses and retrieval. See on how to log evaluations to Phoenix.

End-to-end evaluation flow
In the above screenshot you can see how poor retrieval directly correlates with hallucinations
🧠
Relevance
here
LLM Evaluations
QA Correctness
Hallucinations
Toxicity
Response Evaluation
Retrieval Evaluation
Google Colaboratory
Logo