Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
English
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
English
  • Arize Phoenix
  • Quickstarts
  • User Guide
  • Environments
  • Phoenix Demo
  • 🔭Tracing
    • Overview: Tracing
    • Quickstart: Tracing
      • Quickstart: Tracing (Python)
      • Quickstart: Tracing (TS)
    • Features: Tracing
      • Projects
      • Annotations
      • Sessions
    • Integrations: Tracing
    • How-to: Tracing
      • Setup Tracing
        • Setup using Phoenix OTEL
        • Setup using base OTEL
        • Using Phoenix Decorators
        • Setup Tracing (TS)
        • Setup Projects
        • Setup Sessions
      • Add Metadata
        • Add Attributes, Metadata, Users
        • Instrument Prompt Templates and Prompt Variables
      • Annotate Traces
        • Annotating in the UI
        • Annotating via the Client
        • Running Evals on Traces
        • Log Evaluation Results
      • Importing & Exporting Traces
        • Import Existing Traces
        • Export Data & Query Spans
        • Exporting Annotated Spans
      • Advanced
        • Mask Span Attributes
        • Suppress Tracing
        • Filter Spans to Export
        • Capture Multimodal Traces
    • Concepts: Tracing
      • How Tracing Works
      • What are Traces
      • Concepts: Annotations
      • FAQs: Tracing
  • 📃Prompt Engineering
    • Overview: Prompts
      • Prompt Management
      • Prompt Playground
      • Span Replay
      • Prompts in Code
    • Quickstart: Prompts
      • Quickstart: Prompts (UI)
      • Quickstart: Prompts (Python)
      • Quickstart: Prompts (TS)
    • How to: Prompts
      • Configure AI Providers
      • Using the Playground
      • Create a prompt
      • Test a prompt
      • Tag a prompt
      • Using a prompt
    • Concepts: Prompts
  • 🗄️Datasets & Experiments
    • Overview: Datasets & Experiments
    • Quickstart: Datasets & Experiments
    • How-to: Datasets
      • Creating Datasets
      • Exporting Datasets
    • Concepts: Datasets
    • How-to: Experiments
      • Run Experiments
      • Using Evaluators
  • 🧠Evaluation
    • Overview: Evals
      • Agent Evaluation
    • Quickstart: Evals
    • How to: Evals
      • Pre-Built Evals
        • Hallucinations
        • Q&A on Retrieved Data
        • Retrieval (RAG) Relevance
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Reference (citation) Link
        • User Frustration
        • SQL Generation Eval
        • Agent Function Calling Eval
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Audio Emotion Detection
      • Eval Models
      • Build an Eval
      • Build a Multimodal Eval
      • Online Evals
      • Evals API Reference
    • Concepts: Evals
      • LLM as a Judge
      • Eval Data Types
      • Evals With Explanations
      • Evaluators
      • Custom Task Evaluation
  • 🔍Retrieval
    • Overview: Retrieval
    • Quickstart: Retrieval
    • Concepts: Retrieval
      • Retrieval with Embeddings
      • Benchmarking Retrieval
      • Retrieval Evals on Document Chunks
  • 🌌inferences
    • Quickstart: Inferences
    • How-to: Inferences
      • Import Your Data
        • Prompt and Response (LLM)
        • Retrieval (RAG)
        • Corpus Data
      • Export Data
      • Generate Embeddings
      • Manage the App
      • Use Example Inferences
    • Concepts: Inferences
    • API: Inferences
    • Use-Cases: Inferences
      • Embeddings Analysis
  • ⚙️Settings
    • Access Control (RBAC)
    • API Keys
    • Data Retention
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page
  • How do you evaluate an AI Agent?
  • How to Evaluate an Agent Router
  • Example of Router Evaluation:
  • How to Evaluate Agent Planning
  • How to Evaluate Agent Skills
  • How to Evaluate an Agent's Path
  • How to Evaluate Agent Reflection
  • Putting it all Together

Was this helpful?

Edit on GitHub
  1. Evaluation
  2. Overview: Evals

Agent Evaluation

PreviousOverview: EvalsNextQuickstart: Evals

Last updated 18 days ago

Was this helpful?

Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. For one, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? For another, bad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated?

This page will walk you through a framework for navigating these pitfalls.

How do you evaluate an AI Agent?

An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, we must evaluate each of these components.

We've built evaluation templates for every step:

You can evaluate the individual skills and response using normal LLM evaluation strategies, such as , , , or .

Read more to see the breakdown of each component.

How to Evaluate an Agent Router

Routers are one of the most common components of agents. While not every agent has a specific router node, function, or step, all agents have some method that chooses the next step to take. Routers and routing logic can be powered by intent classifiers, rules-based code, or most often, LLMs that use function calling.

To evaluate a router or router logic, you need to check:

  1. Whether the router chose the correct next step to take, or function to call.

  2. Whether the router extracted the correct parameters to pass on to that next step.

  3. Whether the router properly handles edge cases, such as missing context, missing parameters, or cases where multiple functions should be called concurrently.

Example of Router Evaluation:

Take a travel agent router for example.

  1. User Input: Help me find a flight from SF on 5/15

  2. Router function call: flight-search(date="5/15", departure_city="SF", destination_city="")

Eval
Result

Function choice

✅

Parameter extraction

❌

How to Evaluate Agent Planning

For more complex agents, it may be necessary to first have the agent plan out its intended path ahead of time. This approach can help avoid unnecessary tool calls, or endless repeating loops as the agent bounces between the same steps.

For agents that use this approach, a common evaluation metric is the quality of the plan generated by the agent. This "quality" metric can either take the form of a single overall evaluation, or a set of smaller ones, but either way, should answer:

  1. Does the plan include only skills that are valid?

  2. Are Z skills sufficient to accomplish this task?

  3. Will Y plan accomplish this task given Z skills?

  4. Is this the shortest plan to accomplish this task?

Given the more qualitative nature of these evaluations, they are usually performed by an LLM Judge.

How to Evaluate Agent Skills

Skills are the individual logic blocks, workflows, or chains that an agent can call on. For example, a RAG retriever skill, or a skill to all a specific API. Skills may be written and defined by the agent's designer, however increasingly skills may be outside services connect to via protocols like Anthropic's MCP.

You can evaluate skills using standard LLM or code evaluations. Since you are separately evaluating the router, you can evaluate skills "in a vacuum". You can assume that the skill was chosen correctly, and the parameters were properly defined, and can focus on whether the skill itself performed correctly.

Some common skill evals are:

Skills can be evaluated by LLM Judges, comparing against ground truth, or in code - depending on the skill.

How to Evaluate an Agent's Path

Agent memory is used to store state between different components of an agent. You may store retrieved context, config variables, or any other info in agent memory. However, the most common information stored in agent memory is a long of the previous steps the agent has taken, typically formatted as LLM messages.

These messages form the best data to evaluate the agent's path.

You may be wondering why an agent's path matters. Why not just look at the final output to evaluate the agent's performance? The answer is efficiency. Could the agent have gotten the same answer with half as many LLM calls? When each step increases your cost of operation, path efficiency matters!

The main questions that path evaluations try to answer are:

  • Did the agent go off the rails and onto the wrong pathway?

  • Does it get stuck in an infinite loop?

  • Does it choose the right sequence of steps to take given a whole agent pathway for a single action?

One type of path evaluation is measuring agent convergence. This is a numerical value, which is the length of the optimal path / length of the average path for similar queries.

How to Evaluate Agent Reflection

Reflection allows you to evaluate your agents at runtime to enhance their quality. Before declaring a task complete, a plan devised, or an answer generated, ask the agent to reflect on the outcome. If the task isn't accomplished to the standard you want, retry.

See our Agent Reflection evaluation template for a more specific example.

Putting it all Together

Through a combination of the evaluations above, you can get a far more accurate picture of how your agent is performing.

Evaluating a router

See our for an implementation example.

See our for a specific example.

and for RAG skills

and

See our for a specific example.

See our for a specific example.

For an example of using these evals in combination, see . You can also review our .

🧠
Agent Function Calling evaluation template
Agent Planning evaluation template
Retrieval Relevance
Hallucination
Code Generation
SQL Generation
Toxicity
User Frustration
Agent Convergence evaluation template
Agent Reflection evaluation template
Evaluating an Agent
agent evaluation guide
Agent Function Calling
Agent Path Convergence
Agent Planning
Agent Reflection
Retrieval Evaluation
Classification with LLM Judges
Hallucination
Q&A Correctness
A common router architecture
An example agent skill
The steps an agent has taken, stored as messages
Source:
https://www.anthropic.com/research/building-effective-agents