LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Why Arize AI for Agents?
  • 1. Agent Observability with Auto Instrumentation
  • 2. Agent Evaluations with Online Evals
  • 3. Testing Agents in Prompt Playground with Tool Calling Support
  • 4. Sessions for Agent Interaction Tracking
  • 5. Agent Replay and Agent Pathing (Coming Soon)
  • Additional Resources for Agent Development

Was this helpful?

Arize AI for Agents

Last updated 14 hours ago

Was this helpful?

Arize AI is a powerful AI engineering platform designed to support the development, evaluation, and observability of AI agents. Arize helps developers create robust, high performing agents.

It has first-class support for agent frameworks such as Autogen, OpenAI-agents, LangGraph, and smolagents.

Why Arize AI for Agents?

1. Agent Observability with Auto Instrumentation

Observability is critical for understanding how agents behave in real-world scenarios. Arize AI provides robust tracing capabilities through our open source library, automatically instrumenting your agent applications to capture traces and spans. This includes LLM calls, tool invocations, and data retrieval steps, giving you a detailed view of your agent's workflow.

With just a few lines of code, you can set up tracing for popular frameworks like OpenAI Agents, LangGraph, and Autogen. Learn more about Tracing.

Code Example: Auto Instrumentation for OpenAI Agents

from arize.otel import register

tracer_provider = register(
    space_id = "your-space-id",
    api_key = "your-api-key",
    project_name="agents"
)

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)

2. Agent Evaluations with Online Evals

Evaluating agent performance is essential to ensure reliability and accuracy. Arize AI's online evaluations automatically tag spans with performance labels, helping you identify problematic interactions and measure key metrics.

  • Comprehensive Evaluation Templates: Arize provides templates for evaluating various agent components, such as Tool Calling, Path Convergence, and Planning.

  • Online Evals: With Online Evals, you can run continuous evaluations on production data to monitor correctness, hallucination, relevance, and latency. This ensures your agents perform consistently across diverse scenarios.

  • Custom Metrics and Alerts: Track key metrics on custom dashboards and receive alerts when performance deviates from the norm, allowing proactive optimization of agent behavior.

Code Example: Logging Evaluations to Arize

# Example of logging an evaluation for an agent's response
from arize.api import Client
from arize.utils.types import Environments, ModelTypes

arize_client = Client(space_key="YOUR_SPACE_KEY", api_key="YOUR_API_KEY")
response = arize_client.log_evaluation(
    model_id="agent-model-v1",
    environment=Environments.PRODUCTION,
    model_type=ModelTypes.GENERATIVE_LLM,
    prompt="Plan a trip to Paris.",
    response="Here is a 5-day itinerary for Paris...",
    evaluation_name="Correctness",
    evaluation_score=0.9
)

3. Testing Agents in Prompt Playground with Tool Calling Support

Arize's Prompt Playground is a no-code environment for iterating on prompts and testing agent behaviors, including support for tool calling—a critical feature for agents that interact with external APIs or functions.

  • Iterate on Prompts: Test different prompt templates, models, and parameters side by side to refine how your agent responds to user inputs.

  • Tool Calling Support: Debug tool calling directly in the Playground to ensure your agent selects the right tools and parameters. Learn more about Using Tools in Playground.

  • Save as Experiment: Run systematic A/B tests on datasets to validate agent performance and share results with your team via experiments.

4. Sessions for Agent Interaction Tracking

For chatbot or multi-turn agent applications, tracking sessions is invaluable for debugging and performance analysis. Arize AI supports session tracking to group traces based on interactions.

  • Session ID and User ID: Add session.id and user.id as attributes to spans to group interactions and analyze conversation flows. This helps identify where conversations break or user frustration increases.

  • Debugging Sessions: Use the Arize platform to filter sessions and find underperforming groups of traces. Learn more about Sessions and Users.

Code Example: Adding Session ID for Agent Chatbot

from openinference.instrumentation import using_session

with using_session(session_id="chat-session-456"):
    # Agent interaction within a session
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Book a flight to Paris."}],
        max_tokens=50,
    )

5. Agent Replay and Agent Pathing (Coming Soon)

  • Agent Replay: Replay agent interactions to debug agent tool calling in a controlled environment. Replay will help you simulate past sessions to test improvements without impacting live users.

  • Agent Pathing: Analyze and optimize the pathways your agents take to complete tasks. Understand whether agents are taking efficient routes or getting stuck in loops, with tools to refine planning and convergence strategies.

Additional Resources for Agent Development

OpenInference

Agent Evaluation Guide

Learn how to evaluate every component of your agent.

Try our Tutorials

Explore example notebooks for agents, RAG, tracing, and evaluations.

Watch our Paper Readings

Dive into video discussions on the latest AI research, including agent architectures.

Join our Slack Community

Connect with other developers to ask questions, share insights, and provide feedback on agent development with Arize.