Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
English
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
English
  • Arize Phoenix
  • Quickstarts
  • User Guide
  • Environments
  • Phoenix Demo
  • 🔭Tracing
    • Overview: Tracing
    • Quickstart: Tracing
      • Quickstart: Tracing (Python)
      • Quickstart: Tracing (TS)
    • Features: Tracing
      • Projects
      • Annotations
      • Sessions
    • Integrations: Tracing
    • How-to: Tracing
      • Setup Tracing
        • Setup using Phoenix OTEL
        • Setup using base OTEL
        • Using Phoenix Decorators
        • Setup Tracing (TS)
        • Setup Projects
        • Setup Sessions
      • Add Metadata
        • Add Attributes, Metadata, Users
        • Instrument Prompt Templates and Prompt Variables
      • Annotate Traces
        • Annotating in the UI
        • Annotating via the Client
        • Running Evals on Traces
        • Log Evaluation Results
      • Importing & Exporting Traces
        • Import Existing Traces
        • Export Data & Query Spans
        • Exporting Annotated Spans
      • Advanced
        • Mask Span Attributes
        • Suppress Tracing
        • Filter Spans to Export
        • Capture Multimodal Traces
    • Concepts: Tracing
      • How Tracing Works
      • What are Traces
      • Concepts: Annotations
      • FAQs: Tracing
  • 📃Prompt Engineering
    • Overview: Prompts
      • Prompt Management
      • Prompt Playground
      • Span Replay
      • Prompts in Code
    • Quickstart: Prompts
      • Quickstart: Prompts (UI)
      • Quickstart: Prompts (Python)
      • Quickstart: Prompts (TS)
    • How to: Prompts
      • Configure AI Providers
      • Using the Playground
      • Create a prompt
      • Test a prompt
      • Tag a prompt
      • Using a prompt
    • Concepts: Prompts
  • 🗄️Datasets & Experiments
    • Overview: Datasets & Experiments
    • Quickstart: Datasets & Experiments
    • How-to: Datasets
      • Creating Datasets
      • Exporting Datasets
    • Concepts: Datasets
    • How-to: Experiments
      • Run Experiments
      • Using Evaluators
  • 🧠Evaluation
    • Overview: Evals
      • Agent Evaluation
    • Quickstart: Evals
    • How to: Evals
      • Pre-Built Evals
        • Hallucinations
        • Q&A on Retrieved Data
        • Retrieval (RAG) Relevance
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Reference (citation) Link
        • User Frustration
        • SQL Generation Eval
        • Agent Function Calling Eval
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Audio Emotion Detection
      • Eval Models
      • Build an Eval
      • Build a Multimodal Eval
      • Online Evals
      • Evals API Reference
    • Concepts: Evals
      • LLM as a Judge
      • Eval Data Types
      • Evals With Explanations
      • Evaluators
      • Custom Task Evaluation
  • 🔍Retrieval
    • Overview: Retrieval
    • Quickstart: Retrieval
    • Concepts: Retrieval
      • Retrieval with Embeddings
      • Benchmarking Retrieval
      • Retrieval Evals on Document Chunks
  • 🌌inferences
    • Quickstart: Inferences
    • How-to: Inferences
      • Import Your Data
        • Prompt and Response (LLM)
        • Retrieval (RAG)
        • Corpus Data
      • Export Data
      • Generate Embeddings
      • Manage the App
      • Use Example Inferences
    • Concepts: Inferences
    • API: Inferences
    • Use-Cases: Inferences
      • Embeddings Analysis
  • ⚙️Settings
    • Access Control (RBAC)
    • API Keys
    • Data Retention
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page
  • 🛠️ Develop
  • 🧪 Testing/Staging
  • 🚀 Production

Was this helpful?

Edit on GitHub

User Guide

PreviousQuickstartsNextEnvironments

Last updated 2 months ago

Was this helpful?

Phoenix is a comprehensive platform designed to enable observability across every layer of an LLM-based system, empowering teams to build, optimize, and maintain high-quality applications and agents efficiently.

🛠️ Develop

During the development phase, Phoenix offers essential tools for debugging, experimentation, evaluation, prompt tracking, and search and retrieval.

Traces for Debugging

Phoenix's tracing and span analysis capabilities are invaluable during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it easier to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts.

Experimentation

Leverage experiments to measure prompt and model performance. Typically during this early stage, you'll focus on gather a robust set of test cases and evaluation metrics to test initial iterations of your application. Experiments at this stage may resemble unit tests, as they're geared towards ensure your application performs correctly.

  • Run Experiments

Evaluation

Either as a part of experiments or a standalone feature, evaluations help you understand how your app is performing at a granular level. Typical evaluations might be correctness evals compared against a ground truth data set, or LLM-as-a-judge evals to detect hallucinations or relevant RAG output.

Prompt Engineering

Prompt engineering is critical how a model behaves. While there are other methods such as fine-tuning to change behavior, prompt engineering is the simplest way to get started and often times has the best ROI.

  • Overview: Prompts

Instrument prompt and prompt variable collection to associate iterations of your app with the performance measured through evals and experiments. Phoenix tracks prompt templates, variables, and versions during execution to help you identify improvements and degradations.

  • Instrument Prompt Templates and Prompt Variables

Search & Retrieval Embeddings Visualizer

Phoenix's search and retrieval optimization tools include an embeddings visualizer that helps teams understand how their data is being represented and clustered. This visual insight can guide decisions on indexing strategies, similarity measures, and data organization to improve the relevance and efficiency of search results.

  • Quickstart: Inferences

🧪 Testing/Staging

In the testing and staging environment, Phoenix supports comprehensive evaluation, benchmarking, and data curation. Traces, experimentation, prompt tracking, and embedding visualizer remain important in the testing and staging phase, helping teams identify and resolve issues before deployment.

Iterate via Experiments

With a stable set of test cases and evaluations defined, you can now easily iterate on your application and view performance changes in Phoenix right away. Swap out models, prompts, or pipeline logic, and run your experiment to immediately see the impact on performance.

  • Run Experiments

Evals Testing

Phoenix's flexible evaluation framework supports thorough testing of LLM outputs. Teams can define custom metrics, collect user feedback, and leverage separate LLMs for automated assessment. Phoenix offers tools for analyzing evaluation results, identifying trends, and tracking improvements over time.

  • Quickstart: Evals

Curate Data

Phoenix assists in curating high-quality data for testing and fine-tuning. It provides tools for data exploration, cleaning, and labeling, enabling teams to curate representative data that covers a wide range of use cases and edge conditions.

  • Quickstart: Datasets & Experiments

Guardrails

Add guardrails to your application to prevent malicious and erroneous inputs and outputs. Guardrails will be visualized in Phoenix, and can be attached to spans and traces in the same fashion as evaluation metrics.

  • Guardrails AI

🚀 Production

In production, Phoenix works hand-in-hand with Arize, which focuses on the production side of the LLM lifecycle. The integration ensures a smooth transition from development to production, with consistent tooling and metrics across both platforms.

Traces in Production

Phoenix and Arize use the same collector frameworks in development and production. This allows teams to monitor latency, token usage, and other performance metrics, setting up alerts when thresholds are exceeded.

  • Quickstart: Tracing

Evals for Production

Phoenix's evaluation framework can be used to generate ongoing assessments of LLM performance in production. Arize complements this with online evaluations, enabling teams to set up alerts if evaluation metrics, such as hallucination rates, go beyond acceptable thresholds.

  • How to: Evals

Fine-tuning

Phoenix and Arize together help teams identify data points for fine-tuning based on production performance and user feedback. This targeted approach ensures that fine-tuning efforts are directed towards the most impactful areas, maximizing the return on investment.

Phoenix, in collaboration with Arize, empowers teams to build, optimize, and maintain high-quality LLM applications throughout the entire lifecycle. By providing a comprehensive observability platform and seamless integration with production monitoring tools, Phoenix and Arize enable teams to deliver exceptional LLM-driven experiences with confidence and efficiency.

Evaluation Framework GIF

Quickstart: Tracing
Quickstart: Evals
Fine-Tuning