Arize AI
7. Troubleshoot Embedding Data
Use embeddings to identify & troubleshoot issues with your unstructured data models.
Arize will automatically detect when you have sent embedding data. Embeddings will appear in the Model Overview tab, under the Model Health section.
Embedding feature on the model's Overview page, in the Model Health section
Click on the embedding Name or Drift value to start troubleshooting.

Embedding Drift Tracking

Arize uses the raw embedding vectors it receives to track the drift of your unstructured input data. Once you set up a baseline, Arize will track how distant your group of vectors is to the baseline group of vectors across time. Arize measures Euclidean distance between groups of embedding vectors.
In the example below, we can see (in the plot on the top of the page) that there's a week where there's an increase of this distance, signaling a drift on the input dataset that requires further investigation. In other words, this shows that during that time, the text data that was sent to our model was different than the text data it was trained on.
Drift in dataset

Embedding Troubleshooting with UMAP

UMAP View Settings

Users can configure their UMAP view by:
  • Dimensions: choose between a 2D or a 3D plot.
  • Projection: choose between a full UMAP plot or the quicker Spectral Initialization (read more about Spectral Initialization below).
  • nNeighbors: controls how UMAP balances local versus global structure in the data. More specifically, it controls the definition of local region, i.e., how many neighbors UMAP is going to look at to define a local region. It balances the focus from global to local structure. The lower/higher the value of nNeighbors, the more focus we put in the local/global structure of the dataset. Allowed values range from 5 to 100. Learn more here.
  • minDist: provides the minimum distance apart that points are allowed to be in the low dimensional representation. Allowed values range from 0.0 to 0.99. Learn more here.
Could not load image
UMAP configuration panel
In addition to the above settings, users can select between coloring options for the UMAP plot. The currently supported coloring options are:
  • By Dataset: points will be colored based on if they belong to the baseline or production dataset.
  • By Prediction Label: points will be colored according to the prediction label obtained from the model.
Spectral Initialization is the initialization method used by the UMAP algorithm. This initialization is relatively quick, since it is obtained from linear algebra operations. It provides a good starting point for Stochastic Gradient Descent phase of the UMAP algorithm. Obtaining the spectral initialization plot is faster than a full UMAP plot, although it lacks structural information obtained during the training phase.

Troubleshooting Workflow Example

For further investigation, in addition to the drift tracking on top, you can also leverage the 2D and 3D UMAP visualization of your data to identify new or emerging patterns and resolve the issue.
The UMAP visualizes the point in time selected in the drift visualization above.
To begin troubleshooting, follow these steps.
  • Select a point in time when the drift was low and select a UMAP visualization in 2D. We can see that both training and production data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.
UMAP in 2D, with low drift
  • Select a point in time when the drift was high and select a UMAP visualization in 2D. We can see that both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different to the data it was trained on, and in this case causing performance degradation.
New cluster seen in production during higher drift time period
  • For further inspection, selected a 3D UMAP view and click Explore UMAP to expand the view. With this view we can interact in 3D with our dataset. We can zoom, rotate, and drag so we can see the areas of our dataset that are most interesting to us.
Troubleshooting in 3D UMAP
You can also click on any embedding to see the prediction and actual label / score and raw data associated it for additional troubleshooting.
Check out the full workflow below.
Could not load image
Example: Embedding Drift Tracking

Coming soon

The coloring of the points in the UMAP plot above has been made to distinguish production data vs baseline data (training in this example). More coloring options are coming soon, to help understand/troubleshoot your dataset, including:
  • Color by actual label
  • Color by feature value
  • Color by accuracy (correct vs incorrect predictions)

Additional Resources

Check out our colab example for a tutorial on how to send embeddings to Arize.
Learn more about embeddings and troubleshooting with Arize:
Questions? Email us at [email protected] or Slack us in the #arize-support channel