Arize AI
Search…
7. Troubleshoot Embedding Data
Use embeddings to identify & troubleshoot issues with your unstructured data models.
Arize will automatically detect when you have sent image or text embedding data. Embeddings will appear in the Model Overview tab, under the Model Health section.
Embedding feature on the model's Overview page, in the Model Health section
Click on the embedding Name or Drift value to start troubleshooting.

Embedding Drift Tracking

Arize uses the raw embedding vectors it receives to track the drift of your unstructured input data. Once you set up a baseline, Arize will track how distant your group of vectors is to the baseline group of vectors across time. Arize measures Euclidean distance between groups of embedding vectors.
Learn more about Euclidean Distance or visit our Embeddings FAQ
In the example below, we can see (in the plot on the top of the page) that there's a week where there's an increase of this distance, signaling a drift on the input dataset that requires further investigation. In other words, this shows that during that time, the text data that was sent to our model was different than the text data it was trained on.
Drift in dataset

Embedding Drift Monitors

You can also set up drift monitors to track embedding drift. This allows you to automate drift tracking and receive alerts when your embeddings have drifted. See Drift Monitors for more details.

Embedding Troubleshooting with UMAP

UMAP View Settings

Users can configure their UMAP view by:
  • Dimensions: choose between a 2D or a 3D plot.
  • Projection: choose between a full UMAP plot or the quicker Spectral Initialization (read more about Spectral Initialization below).
  • nNeighbors: controls how UMAP balances local versus global structure in the data. More specifically, it controls the definition of local region, i.e., how many neighbors UMAP is going to look at to define a local region. It balances the focus from global to local structure. The lower/higher the value of nNeighbors, the more focus we put in the local/global structure of the dataset. Allowed values range from 5 to 100. Learn more here.
  • minDist: provides the minimum distance apart that points are allowed to be in the low dimensional representation. Allowed values range from 0.0 to 0.99. Learn more here.
UMAP configuration panel
Spectral Initialization is the initialization method used by the UMAP algorithm. This initialization is relatively quick, since it is obtained from linear algebra operations. It provides a good starting point for Stochastic Gradient Descent phase of the UMAP algorithm. Obtaining the spectral initialization plot is faster than a full UMAP plot, although it lacks structural information obtained during the training phase.
In addition to the above settings, users can select between coloring options for the UMAP plot. The currently supported coloring options are:
  • By Dataset: points will be colored based on if they belong to the baseline or primary dataset.
  • By Prediction Label: points will be colored according to the prediction label obtained from the model.
  • By Actual Label: points will be colored according to the actual label obtained from the model.
  • By Correctness: points will be colored based on whether or not the prediction was correct (i.e., does the prediction label match the actual label).
  • By Confusion Matrix: after selecting a positive class, points will be colored by their confusion matrix value (true positive, true negative, false positive, or false negative).
  • By Feature: Identify patterns or insights in slices of data by choosing to color by feature

Troubleshooting Workflow Example

For further investigation, in addition to the drift tracking on top, users can also leverage the 2D and 3D UMAP visualization of their data to identify new or emerging patterns and resolve the issue.
The UMAP visualizes the point in time selected in the drift visualization above.
To begin troubleshooting, follow these steps.
  • Select a point in time when the drift was low and select a UMAP visualization in 2D. As seen in the image below, both training and production data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.
UMAP in 2D, with low drift
  • Select a point in time when the drift was high and select a UMAP visualization in 2D. In this instance, both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different to the data it was trained on, and in this case causing performance degradation.
New cluster seen in production during higher drift time period
  • For further inspection, select a 3D UMAP view and click Explore UMAP to expand the view. With this view, we can interact in 3D with our dataset. Users can zoom, rotate, and drag to see the areas of the dataset that are most interesting.
Generate 3D UMAP and click Explore
Click on any embedding to see the prediction and actual label / score and raw data associated it for additional troubleshooting.
To select more than one embedding, select the lasso feature, and draw a circle around the group of embeddings.
Choose to view the embeddings by different coloring options to better understand your data. Select between:
  • Color by Dataset
  • Color by Prediction Label
  • Color by Actual Label
  • Color by Correctness (correct vs. incorrect predictions)
  • Color by Confusion Matrix
Check out the full workflow below.
Troubleshoot and explore your embedding data

Additional Resources

Check out our colabs here for a tutorial on how to send embeddings to Arize and begin troubleshooting.
Learn more about embeddings and troubleshooting with Arize:
Questions? Email us at [email protected] or Slack us in the #arize-support channel
Copy link
Outline
Embedding Drift Tracking
Embedding Troubleshooting with UMAP
Additional Resources