Data Distribution Visualization
Configure the visualization of your data for distribution comparisons when troubleshooting drift.
Last updated
Configure the visualization of your data for distribution comparisons when troubleshooting drift.
Last updated
Copyright © 2023 Arize AI, Inc
Distribution comparisons are useful for both visualizing data, as well as calculating drift metrics such as PSI.
For categorical values, Arize simply calculates the percentage of data that falls under each unique value, and display the data in descending order of data volume.
For numeric features, it often makes sense to group the values into bins, in order to show a useful summary of the data. However, there is no one-size-fits-all strategy for numeric binning that will work for a wide variety of data shapes. We will cover the best option for each use case below.
We offer 4 types of binning for numeric features:
You can try out different visualizations in the feature details page. When you change your binning option, you will be able to update binning for that feature across the platform.
This will affect:
PSI calculations
Drift monitors - both visualization and PSI calculations
Performance tracing breakdown for that feature
Model overview page (PSI value)
If you have approximately normally distributed data, use median centered bins (the default).
If you have a feature that encodes a boolean or an ID, use discrete bins.
If you have a feature that’s represented by only a small range of integers, such as number of actions in a day, try discrete bins.
If you want to view your feature with exactly equal width bins, use the equal width bins option.
If you already know your binning strategy or have business logic with hard cutoff points, use custom bins.
For numeric only
This is our default binning method - it works well for normally distributed data but is good for highly skewed data as well.
This method creates up to 10 bins, with the following constraints:
The center of the bins (the division between bins 5 and 6) is at the median.
The 8 center bins have equal width. The width of each bin is ⅓ of the standard deviation of the data. These are the purple squares below.
The edge bins have variable width and end at the min/max of the dataset in order to account for long tails. These are the red rectangles below.
Bins on the edge with zero data will be removed - possibly producing less than 10 bins.
This works very well for most normally distributed data, even if there is a long tail. Take the annual income feature in our model below. Income is normally distributed within a range, with a long tail on the right for high earners. In Arize, this feature is binned like this:
The majority of the data is centered around median income of 43k, while about 30% of the data falls into the left and right edge bins.
For numeric or categorical features
Discrete bins allows users to see each value independently in the distribution chart. Note that for categorical features, this is the only binning option.
For numeric features, this works particularly well for these use cases:
Sometimes, a boolean value or an ID may be expressed as an integer. Since the numeric value of these features is not actually relevant, using median centered bins above would not produce the right results.
For example, this is what a boolean value looks like with median centered bins.
By choosing discrete bins, you can easily see the distribution of the only two values for this feature, 0 and 1:
In this example, we have an ID for a type of procedure, encoded as an integer. Median centered bins combine multiple values because they are numerically close, even though as an ID they may have no relationship.
By choosing discrete bins, users can see the frequency of each ID independently.
For small integer ranges, such as a count, discrete bins offer a more detailed view of the data than median centered bins.
For example, this is a count of the orders in a day. With a small number of unique values, discrete bins offer a more granular view of the data.
For numeric features only
This option creates equal width bins. The bin width is simply (max - min) / num_bins
, where num_bins is specified by the user.
This option is useful for fixed numeric ranges, for example, FICO scores.
For numeric features only
Custom bins offer ultimate control over the visualization of numeric data. This is helpful when you already know how to visualize your data, either from prior analysis, or from a business perspective where certain cutoffs already exist.
Using the same FICO score example, creditors may have certain cutoffs for FICO scores. Say, a FICO score below 500 results in an automatic application rejection. For scores above 500, every 20 points results in a better interest rate than the previous bucket.
Aligning the binning strategy with business logic ensures the drift visualization is relevant.