LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
      • Replay spans
      • Compare prompts side-by-side
      • Load a dataset into playground
      • Save playground outputs as an experiment
      • ✨Copilot: prompt builder
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Logging Latent Metadata
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • How to Troubleshoot Performance Monitors
  • Performance Over Time
  • Add A Comparison Dataset
  • Selecting Your View
  • Slice View
  • Performance Breakdown
  • Table View
  • Embeddings Projector View
  • Confusion Matrix and Calibration Chart View

Was this helpful?

  1. Machine Learning
  2. Machine Learning
  3. How To: ML

Performance Tracing

Last updated 7 months ago

Was this helpful?

Need help troubleshooting? ✨ can help!

How to Troubleshoot Performance Monitors

Once a monitor triggers, alerting you that performance has dropped, troubleshoot your performance monitors in the Performance Tracing tab within your model. Performance tracing enables you to easily understand the features and slices that impact your model's performance the most and begin resolution.

Performance Over Time

The Performance Tracing tab immediately visualizes your performance metric over time layered on top of your model's prediction volume. This gives you a broad understanding of your model's overall performance to identify areas of improvement, compare different datasets, and examine problematic slices.

The 'Performance Over Time' graph is highly configurable. Use this graph to visualize different dimensions such as:

  • Environments: pick from production, validation, or training environments

  • Versions: pick from any model version

  • Time periods: zoom in or out on any time period for your dataset

  • Performance metrics: choose from an array of performance metrics such as accuracy, AUC, MAE, MAPE, RMSE, sMAPE, WAPE, and more.

  • Filters: layer additional filters across features, prediction values, actuals, and tags to see your model's performance on a more granular level.

Add A Comparison Dataset

Send data from different environments to compare model performance between training, validation, or a different time period within your production data. Comparing your production data helps you identify gaps in data quality or where drift occurs for simple troubleshooting.

Navigate to the toolbar, click 'Add a Comparison' and pick from a different environment, version, or time period.

Selecting Your View

Users have the choice of selecting between 3 troubleshooting views: Slice, Table, and Output Segmentation.

Slice View

Performance Breakdown

To identify key areas to improve with your comparison dataset, break performance issues down using Performance Insights and our Performance Heat Map.

Performance Insights

The Performance Insights panel surfaces the worst-performing slices impacting your model to perform a counterfactual analysis. Use Performance Insights to exclude features or slices as a filter to identify how your model's performance changes.

To do this, scroll down to the 'Performance Insights' card and click on a feature. Once you click into a feature, a histogram of your feature slices will populate on the left side with options to 'Add cohort as a filter', 'Exclude cohort as a filter', and 'View explainability'.

Performance Heat Map

The performance heat map visualizes your feature's performance by slice view to visually indicate the worst-performing slices within each feature. Click on the carrot on the left side of your feature's name to uncover its histogram.

Compare feature performance amongst different environments, versions, and filters to uncover areas of improvement. Look out for different colors and distributions between the two histograms to identify areas of missing or poor-performing data.

Once you've identified an area of interest, click on the 'View Feature Details' link to uncover a detailed view of your feature distribution over time.

Table View

The Table View enables users to see and interact with individual records in a simple table.

Data Exploration and Validation:

Get a better understanding of what your data looks like by exploring a record-level view. This is similar to a df.head within a notebook environment. Validate the data that was sent into the platform to make sure it was sent in the correct data format.

Column Selector:

Explore any column in your data, including features and tags, using the column selector. Customize your table view by adding/removing columns, and re-ordering columns.

Slide Over:

Click on a table row to get a comprehensive view at the prediction level of all of the columns of your data for deeper analysis.

Embeddings Projector View

The Embeddings Projector view automatically surfaces the worst performing embeddings clusters for quick troubleshooting. This additional view is especially helpful when troubleshooting LLMs with prompts and responses, where switching between the Table, Embeddings Projector, and Slice views can help teams get a full picture of how their LLM is performing.

Confusion Matrix and Calibration Chart View

Confusion Matrix

A confusion matrix provides a summary of all prediction results of a classification problem. Each result is shown with its corresponding number of correct/incorrect predictions (True Positive, True Negative, False Positive, False Negative), count values and classification criteria. By providing a summary of all possible results, the confusion matrix lets you know the ways your classification model could get confused when making the predictions. It helps identify errors and the type of errors made by the model and thus helps improve the accuracy of the classification model.

Calibration Chart

This chart plots Average Actuals against Estimated Probability. The better calibrated the model, the closer the plotted points will be to the diagonal line.

  • If the model points are below the line: your model has over-forecast in its prediction. For example, predicting a credit card charge has a high probability of fraud when it not fraudulent.

  • If the model points are above the line: your model has an under-forecast in its prediction. For example, predicting a credit card charge has a low likelihood of being fraud when it's actually fraudulent.

A performance slice is a subset of model values formed from any model dimension, such as specific periods of time, set of features, etc. Learn more about slices .

📈
Copilot
Performance Over Time
Add A Comparison Dataset
Performance Breakdown
(Slice View)
Data Validation and Exploration (Table View)
Embeddings Projector View
Calibration Chart and Confusion Matrix (Output Segmentation View)
here
Select between table, embeddings, slices, or confusion matrix / calibration chart views of the data
Record-level view of the predictions Arize has ingested for a model
Select which columns from the data to dispaly in the table
View all of the data for each prediction