UMAP & Cluster Analysis
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that belongs to the neighbor graph category. Arize uses it to create a lower dimensional representation (2D or 3D) of your dataset, represented by embedding vectors.
- 1.Select your data and Generate UMAP
- Select a point on the drift visualization at the top of the page and generate UMAP to visualize the point in time selected
- 2.Investigate your worst drifted clusters
You can select a specific cluster to further investigate and view the data associated with the cluster
- Clusters are groups of related points in the point cloud. The closer points or clusters are to each other, the more similar they are. Clusters allow you to easily visualize data that differs from your baseline but also allows you to visualize the global and local structure of your data.
- 3.Investigate the data points that belong in that clusterWhen you select a cluster, Arize surfaces the data associated (shown on the right) and you can further investigate any data point by clicking "View Details"
- 4.Colorize/Filter your dataUMAP enables users to identify patterns or the structure in the data to explain where model can be improved by applying colorizations and filters
Users can colorize and filter the UMAP visualization:
- By Dataset: points will be colored based on if they belong to the baseline or primary dataset.
- By Prediction Label/Score: points will be colored according to the prediction label/score obtained from the model.
- By Actual Label/Score: points will be colored according to the actual label/score obtained from the model.
- By Correctness: points will be colored based on whether or not the prediction was correct (i.e., does the prediction label match the actual label).
- By Confusion Matrix: after selecting a positive class, points will be colored by their confusion matrix value (true positive, true negative, false positive, or false negative).
- By Tag: identify patterns or insights in slices of data by choosing to color by tag.
- By Feature: Identify patterns or insights in slices of data by choosing to color by feature.
Users can configure their UMAP generation by these parameters:
- Dimensions: choose between a 2D or a 3D plot.
- nNeighbors: controls how UMAP balances local versus global structure in the data. More specifically, it controls the definition of the local region, i.e., how many neighbors UMAP will look at to define a local region. It balances the focus from global to local structure. The lower/higher the value of
nNeighbors, the more focus we put on the local/global structure of the dataset. Allowed values range from
100. Learn more here.
- minDist: provides the minimum distance apart that points are allowed to be in the low dimensional representation. Allowed values range from
0.99. Learn more here.
- Sample size: the number of points in the UMAP plot per dataset. For example, if you select 500 there will be 1000 points total in the plot. Allowed values range from 300 to 2500.
Clustering is the process of grouping similar data points together. The goal of clustering is to find patterns and structure in a data set and to divide the data points into groups, or clusters, that share certain characteristics.
Our clustering algorithm is an unsupervised learning technique, which means that it works on unlabelled data and finds patterns on its own.
Clusters help you find patterns and structure in your dataset. Users are able to troubleshoot performance degradation by examining the underlaying data in the form of clusters and use these insights to improve your models performance.
- you might realize that your model is confusing two classes that are similar (i.e sandals and sneakers)
- you have a cluster with a drift score close to -1, meaning that model is seeing production data that is unlike the training
The drift score measures the reference data coverage present in a given cluster or point cloud. A score of -1 means that the cluster only contains primary, or production, data. A score of 1 means that the cluster only contains only baseline data. A score of 0 means the cluster is equally composed of baseline and primary data. The white and blue bars represent the count in each dataset in that cluster.
In this example, the cluster contains 16 points in production, and 55 points in the baseline dataset.