LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
      • Replay spans
      • Compare prompts side-by-side
      • Load a dataset into playground
      • Save playground outputs as an experiment
      • ✨Copilot: prompt builder
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Logging Latent Metadata
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Changing binning for numeric features
  • Quick guide:
  • Binning Options for Data Visualization:
  • Median centered binning (default)
  • Discrete bins
  • Equal width bins
  • Custom bins

Was this helpful?

  1. Machine Learning
  2. Machine Learning
  3. How To: ML
  4. Drift Tracing

Data Distribution Visualization

Configure the visualization of your data for distribution comparisons when troubleshooting drift.

Last updated 1 year ago

Was this helpful?

Distribution comparisons are useful for both visualizing data, as well as calculating drift metrics such as PSI.

For categorical values, Arize simply calculates the percentage of data that falls under each unique value, and display the data in descending order of data volume.

For numeric features, it often makes sense to group the values into bins, in order to show a useful summary of the data. However, there is no one-size-fits-all strategy for numeric binning that will work for a wide variety of data shapes. We will cover the best option for each use case below.

Example of categorical bins

Changing binning for numeric features

We offer 4 types of binning for numeric features:

You can try out different visualizations in the feature details page. When you change your binning option, you will be able to update binning for that feature across the platform.

This will affect:

  1. PSI calculations

  2. Drift monitors - both visualization and PSI calculations

  3. Performance tracing breakdown for that feature

  4. Model overview page (PSI value)

Quick guide:

Binning Options for Data Visualization:

Median centered binning (default)

For numeric only

This is our default binning method - it works well for normally distributed data but is good for highly skewed data as well.

This method creates up to 10 bins, with the following constraints:

  1. The center of the bins (the division between bins 5 and 6) is at the median.

  2. The 8 center bins have equal width. The width of each bin is ⅓ of the standard deviation of the data. These are the purple squares below.

  3. The edge bins have variable width and end at the min/max of the dataset in order to account for long tails. These are the red rectangles below.

  4. Bins on the edge with zero data will be removed - possibly producing less than 10 bins.

This works very well for most normally distributed data, even if there is a long tail. Take the annual income feature in our model below. Income is normally distributed within a range, with a long tail on the right for high earners. In Arize, this feature is binned like this:

The majority of the data is centered around median income of 43k, while about 30% of the data falls into the left and right edge bins.

Discrete bins

For numeric or categorical features

Discrete bins allows users to see each value independently in the distribution chart. Note that for categorical features, this is the only binning option.

For numeric features, this works particularly well for these use cases:

Booleans or IDs

Sometimes, a boolean value or an ID may be expressed as an integer. Since the numeric value of these features is not actually relevant, using median centered bins above would not produce the right results.

For example, this is what a boolean value looks like with median centered bins.

By choosing discrete bins, you can easily see the distribution of the only two values for this feature, 0 and 1:

In this example, we have an ID for a type of procedure, encoded as an integer. Median centered bins combine multiple values because they are numerically close, even though as an ID they may have no relationship.

By choosing discrete bins, users can see the frequency of each ID independently.

Small integer values

For small integer ranges, such as a count, discrete bins offer a more detailed view of the data than median centered bins.

For example, this is a count of the orders in a day. With a small number of unique values, discrete bins offer a more granular view of the data.

Equal width bins

For numeric features only

This option creates equal width bins. The bin width is simply (max - min) / num_bins, where num_bins is specified by the user.

This option is useful for fixed numeric ranges, for example, FICO scores.

Custom bins

For numeric features only

Custom bins offer ultimate control over the visualization of numeric data. This is helpful when you already know how to visualize your data, either from prior analysis, or from a business perspective where certain cutoffs already exist.

Using the same FICO score example, creditors may have certain cutoffs for FICO scores. Say, a FICO score below 500 results in an automatic application rejection. For scores above 500, every 20 points results in a better interest rate than the previous bucket.

Aligning the binning strategy with business logic ensures the drift visualization is relevant.

If you have approximately normally distributed data, use (the default).

If you have a feature that encodes a boolean or an ID, use .

If you have a feature that’s represented by only a small range of integers, such as number of actions in a day, try .

If you want to view your feature with exactly equal width bins, use the option.

If you already know your binning strategy or have business logic with hard cutoff points, use .

Boolean value with median centered bins
Boolean value with discrete bins shows the only 2 values
ID with median centered bins combines numerically close values
IDs with discrete binnong show each ID independently
Median centered bins groups values together
Discrete bins provide a more granular view
With median centered bins, the shape of the data is somewhat hidden by wide edge bins
With equal width bins, users can more clearly see the shape of the data
📈
Median centered bins
Discrete bins
Equal width bins
Custom bins
median centered bins
discrete bins
discrete bins
equal width bins
custom bins
Normal, bell-shaped curve, showing % of cases in 8 portions of the curve