Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
English
  • Documentation
  • Self-Hosting
  • Cookbooks
  • SDK and API Reference
  • Release Notes
  • Resources
English
  • Arize Phoenix
  • Quickstarts
  • User Guide
  • Environments
  • Phoenix Demo
  • 🔭Tracing
    • Overview: Tracing
    • Quickstart: Tracing
      • Quickstart: Tracing (Python)
      • Quickstart: Tracing (TS)
    • Features: Tracing
      • Projects
      • Annotations
      • Sessions
    • Integrations: Tracing
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • LiteLLM
      • Anthropic
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • VertexAI
      • Agno
      • Model Context Protocol (MCP)
      • MistralAI
      • Google GenAI
      • Groq
      • Hugging Face smolagents
      • CrewAI
      • Haystack
      • DSPy
      • Instructor
      • OpenAI Node SDK
      • LangChain.js
      • Vercel AI SDK
      • LangFlow
      • BeeAI
      • Flowise
    • How-to: Tracing
      • Setup Tracing
        • Setup using Phoenix OTEL
        • Setup using base OTEL
        • Using Phoenix Decorators
        • Setup Tracing (TS)
        • Setup Projects
        • Setup Sessions
      • Add Metadata
        • Add Attributes, Metadata, Users
        • Instrument Prompt Templates and Prompt Variables
      • Annotate Traces
        • Annotating in the UI
        • Annotating via the Client
        • Running Evals on Traces
        • Log Evaluation Results
      • Importing & Exporting Traces
        • Import Existing Traces
        • Export Data & Query Spans
        • Exporting Annotated Spans
      • Advanced
        • Mask Span Attributes
        • Suppress Tracing
        • Filter Spans to Export
        • Capture Multimodal Traces
    • Concepts: Tracing
      • How Tracing Works
      • What are Traces
      • Concepts: Annotations
      • FAQs: Tracing
  • 📃Prompt Engineering
    • Overview: Prompts
      • Prompt Management
      • Prompt Playground
      • Span Replay
      • Prompts in Code
    • Quickstart: Prompts
      • Quickstart: Prompts (UI)
      • Quickstart: Prompts (Python)
      • Quickstart: Prompts (TS)
    • How to: Prompts
      • Configure AI Providers
      • Using the Playground
      • Create a prompt
      • Test a prompt
      • Tag a prompt
      • Using a prompt
    • Concepts: Prompts
  • 🗄️Datasets & Experiments
    • Overview: Datasets & Experiments
    • Quickstart: Datasets & Experiments
    • How-to: Datasets
      • Creating Datasets
      • Exporting Datasets
    • Concepts: Datasets
    • How-to: Experiments
      • Run Experiments
      • Using Evaluators
  • 🧠Evaluation
    • Overview: Evals
      • Agent Evaluation
    • Quickstart: Evals
    • How to: Evals
      • Pre-Built Evals
        • Hallucinations
        • Q&A on Retrieved Data
        • Retrieval (RAG) Relevance
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Reference (citation) Link
        • User Frustration
        • SQL Generation Eval
        • Agent Function Calling Eval
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Audio Emotion Detection
      • Eval Models
      • Build an Eval
      • Build a Multimodal Eval
      • Online Evals
      • Evals API Reference
    • Concepts: Evals
      • LLM as a Judge
      • Eval Data Types
      • Evals With Explanations
      • Evaluators
      • Custom Task Evaluation
  • 🔍Retrieval
    • Overview: Retrieval
    • Quickstart: Retrieval
    • Concepts: Retrieval
      • Retrieval with Embeddings
      • Benchmarking Retrieval
      • Retrieval Evals on Document Chunks
  • 🌌inferences
    • Quickstart: Inferences
    • How-to: Inferences
      • Import Your Data
        • Prompt and Response (LLM)
        • Retrieval (RAG)
        • Corpus Data
      • Export Data
      • Generate Embeddings
      • Manage the App
      • Use Example Inferences
    • Concepts: Inferences
    • API: Inferences
    • Use-Cases: Inferences
      • Embeddings Analysis
  • 🔌INTEGRATIONS
    • Phoenix MCP Server
    • Cleanlab
    • Ragas
  • ⚙️Settings
    • Access Control (RBAC)
    • API Keys
    • Data Retention
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page
  • Annotation Types
  • Annotation Targets
  • Feedback from End-users
  • Evaluations from LLMs
  • Human Annotations
  • How to Use Annotations
  • Track Improvements during Experimentation
  • Train an LLM Judge
  • Annotator Kind
  • Annotation Source
  • Annotation Configuration

Was this helpful?

Edit on GitHub
  1. Tracing
  2. Concepts: Tracing

Concepts: Annotations

PreviousWhat are TracesNextFAQs: Tracing

Last updated 3 days ago

Was this helpful?

Annotation Types

Depending on what you want to do with your annotations, you may want to configure a rubric for what your annotation represents - e.g. is it a category, number with a range (continuous), or freeform.

  • annotation type: - Categorical: Predefined labels for selection. (e.x. 👍 or 👎) - Continuous: a score across a specified range. (e.g. confidence score 0-100) - Freeform: Open-ended text comments. (e.g. "correct")

  • Optimize the direction based on your goal: - Maximize: higher scores are better. (e.g. confidence) - Minimize: lower scores are better. (e.g. hallucinations) - None: direction optimization does not apply. (e.g. tone)

Different types of annotations change the way human annotators provide feedback

See Annotating in the UIfor more details

Annotation Targets

Phoenix supports annotating different annotation targets to capture different levels of LLM application performance. The core annotation types include:

  • Span Annotations: Applied to individual spans within a trace, providing granular feedback about specific components

  • Document Annotations: Specifically for retrieval systems, evaluating individual documents with metrics like relevance and precision

Each annotation can include:

  • Labels: Text-based classifications (e.g., "helpful" or "not helpful")

  • Scores: Numeric evaluations (e.g., 0-1 scale for relevance)

  • Explanations: Detailed justifications for the annotation

These annotations can come from different sources:

  • Human feedback (e.g., thumbs up/down from end-users)

  • LLM-as-a-judge evaluations (automated assessments)

  • Code-based evaluations (programmatic metrics)

Phoenix also supports specialized evaluation metrics for retrieval systems, including NDCG, Precision@K, and Hit Rate, making it particularly useful for evaluating search and retrieval components of LLM applications.

Feedback from End-users

For more information on how to wire up your application to collect feedback from your users, see Annotating via the Client

Evaluations from LLMs

  • Quickstart: Evals to generate evaluation results

  • Log Evaluation Results to add evaluation results to spans

Human Annotations

Sometimes you need to rely on human annotators to attach feedback to specific traces of your application. Human annotations through the UI can be thought of as manual quality assurance. While it can be a bit more labor intensive, it can help in sharing insights within a team, curating datasets of good/bad examples, and even in training an LLM judge.

How to Use Annotations

Annotations can help you share valuable insight about how your application is performing. However making these insights actionable can be difficult. With Phoenix, the annotations you add to your trace data is propagated to datasets so that you can use the annotations during experimentation.

Track Improvements during Experimentation

Train an LLM Judge

AI development currently faces challenges when evaluating LLM application outputs at scale:

  • Human annotation is precise but time-consuming and impossible to scale efficiently.

  • Existing automated methods using LLM judges require careful prompt engineering and often fall short of capturing human evaluation nuances.

  • Solutions requiring extensive human resources are difficult to scale and manage.

These challenges create a bottleneck in the rapid development and improvement of high-quality LLM applications.

Since Phoenix datasets preserve the annotations in the example metadata, you can use datasets to build human-preference calibrated judges using libraries and tools such as DSPy and Zenbase.

Annotator Kind

Phoenix supports three types of annotators: Human, LLM, and Code.

Annotator Kind

Source

Purpose

Strengths

Use Case

Human

Manual review

Expert judgment and quality assurance

High accuracy, nuanced understanding

Manual QA, edge cases, subjective evaluation

LLM

Language model output

Scalable evaluation of application responses

Fast, scalable, consistent across examples

Large-scale output scoring, pattern review

Code

Programmatic evaluators

Automated assessment based on rules/metrics

Objective, repeatable, useful in experiments

Model benchmarking, regression testing

Annotation Source

Phoenix provides two interfaces for annotations: API and APP. The API interface via the REST clients enables automated feedback collection at scale, such as collecting thumbs up/down from end-users in production, providing real-time insights into LLM system performance. The APP interface via the UI offers an efficient workflow for human annotators with hotkey support and structured configurations, making it practical to create high-quality training sets for LLMs.

The combination of these interfaces creates a powerful feedback loop: human annotations through the APP interface help train and calibrate LLM evaluators, which can then be deployed at scale via the API. This cycle of human oversight and automated evaluation helps identify the most valuable examples for review while maintaining quality at scale.

Annotation Configuration

Annotation configurations in Phoenix are designed to maximize efficiency for human annotators. The system allows you to define the structure of annotations (categorical or continuous values, with appropriate bounds and options) and pair these with keyboard shortcuts (hotkeys) to enable rapid annotation.

For example, a categorical annotation might be configured with specific labels that can be quickly assigned using number keys, while a continuous annotation might use arrow keys for fine-grained scoring. This combination of structured configurations and hotkey support allows annotators to provide feedback quickly, significantly reducing the effort required for manual annotation tasks.

The primary goal is to streamline the annotation workflow, enabling human annotators to process large volumes of data efficiently while maintaining quality and consistency in their feedback.

Human feedback allows you to understand how your users are experiencing your application and helps draw attention to problematic traces. Phoenix makes it easy to collect feedback for traces and view it in the context of the trace, as well as filter all your traces based on the feedback annotations you send. Before anything else, you want to know if your users or customers are happy with your product. This can be as straightforward as adding buttons to your application, and logging the result as annotations.

When you have large amounts of data it can be immensely efficient and valuable to leverage LLM judges via evals to produce labels and scores to annotate your traces with. Phoenix's as well as other third-party eval libraries can be leveraged to annotate your spans with evaluations. For details see:

A span's attributes as well as annotations are propagated to example metadata

Since Phoenix datasets preserve the annotations, you can track whether or not changes to your application (e.g. ) produce better results (e.g. better scores / labels). Phoenix have access to the example metadata at evaluation time, making it possible to track improvements / regressions over previous generations (e.g. the previous annotations).

🔭
👍
👎
evals library
evaluators
experimentation