Embedding & Cluster Analyzer
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
Embedding Projectors are a great tool in visualizing and interpreting embeddings. In order to do so, we have to apply an algorithm to reduce the dimensionality of the embeddings to 2D/3D. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that belongs to the neighbor graph category. Arize uses it to create a lower dimensional representation (2D or 3D) of your dataset, represented by embedding vectors.
Learn more about UMAP in comparison to other neighbor graph algorithms here.
Select your data and Generate UMAP
Select a point on the drift visualization at the top of the page and generate UMAP to visualize the point in time selected
Investigate your worst drifted clusters
Clusters are groups of related points in the point cloud. The closer points or clusters are to each other, the more similar they are. Clusters allow you to easily visualize data that differs from your baseline but also allows you to visualize the global and local structure of your data.
Investigate the data points that belong in that cluster
Colorize/Filter your data
UMAP enables users to identify patterns or the structure in the data to explain where model can be improved by applying colorizations and filters
Users can colorize and filter the UMAP visualization:
By Dataset: points will be colored based on if they belong to the baseline or primary dataset.
By Prediction Label/Score: points will be colored according to the prediction label/score obtained from the model.
By Actual Label/Score: points will be colored according to the actual label/score obtained from the model.
By Correctness: points will be colored based on whether or not the prediction was correct (i.e., does the prediction label match the actual label).
By Confusion Matrix: after selecting a positive class, points will be colored by their confusion matrix value (true positive, true negative, false positive, or false negative).
By Tag: identify patterns or insights in slices of data by choosing to color by tag.
By Feature: Identify patterns or insights in slices of data by choosing to color by feature.
Users can configure their UMAP generation by these parameters:
Dimensions: choose between a 2D or a 3D plot.
nNeighbors: controls how UMAP balances local versus global structure in the data. More specifically, it controls the definition of the local region, i.e., how many neighbors UMAP will look at to define a local region. It balances the focus from global to local structure. The lower/higher the value of nNeighbors
, the more focus we put on the local/global structure of the dataset. Allowed values range from 5
to 100
. Learn more here.
minDist: provides the minimum distance apart that points are allowed to be in the low dimensional representation. Allowed values range from 0.0
to 0.99
. Learn more here.
Sample size: the number of points in the UMAP plot per dataset. For example, if you select 500 there will be 1000 points total in the plot. Allowed values range from 300 to 2500.
Clustering is the process of grouping similar data points together. The goal of clustering is to find patterns and structure in a data set and to divide the data points into groups, or clusters, that share certain characteristics.
Our clustering algorithm is an unsupervised learning technique, which means that it works on unlabelled data and finds patterns on its own.
Clusters help you find patterns and structure in your dataset. Users are able to troubleshoot performance degradation by examining the underlaying data in the form of clusters and use these insights to improve your models performance.
Examples:
you might realize that your model is confusing two classes that are similar (i.e sandals and sneakers)
you have a cluster with a drift score close to -1, meaning that model is seeing production data that is unlike the training
The drift score measures the reference data coverage present in a given cluster or point cloud. A score of -1 means that the cluster only contains primary, or production, data. A score of 1 means that the cluster only contains only baseline data. A score of 0 means the cluster is equally composed of baseline and primary data. The white and blue bars represent the count in each dataset in that cluster.
After choosing your desired cluster metric (e.g. euclidean distance, accuracy, custom metric, etc.), Arize automatically surfaces the clusters you should focus on for model improvement / troubleshooting so you can quickly find the root cause.
You can select the metric you want to use, and how you want the clusters to be sorted.
Once a cluster that is impacting model performance has been identified, users can download the data in the cluster for active learning. This data includes all the information needed for labeling workflows. These clusters are highly focused groups of datapoints, enabling labeling teams to be more precise in their efforts.
Arize enables teams to continue analysis of their production data in notebooks.
Learn more here.
With a few lines of Python code, users can export their data into or a Jupyter notebook for further analysis.