7. Troubleshoot Embedding Data
Use embeddings to identify & troubleshoot issues with your unstructured data models.
Arize will automatically detect when you have sent image or text embedding data. Embeddings will appear in the Model Overview tab, under the Model Health section.
Embedding feature on the model's Overview page, in the Model Health section
Click on the embedding Name or Drift value to start troubleshooting.
Arize uses the raw embedding vectors it receives to track the drift of your unstructured input data. Once you set up a baseline, Arize will track how distant your group of vectors is to the baseline group of vectors across time. Arize measures Euclidean distance between groups of embedding vectors.
In the example below, we can see (in the plot on the top of the page) that there's a week where there's an increase of this distance, signaling a drift on the input dataset that requires further investigation. In other words, this shows that during that time, the text data that was sent to our model was different than the text data it was trained on.
Drift in dataset
Users can configure their UMAP view by:
- Dimensions: choose between a 2D or a 3D plot.
- Projection: choose between a full UMAP plot or the quicker Spectral Initialization (read more about Spectral Initialization below).
- nNeighbors: controls how UMAP balances local versus global structure in the data. More specifically, it controls the definition of local region, i.e., how many neighbors UMAP will look at to define a local region. It balances the focus from global to local structure. The lower/higher the value of nNeighbors, the more focus we put in the local/global structure of the dataset. Allowed values range from
100. Learn more here.
- Sample size: the number of points in the UMAP plot per dataset. For example if you select 500 there will be 1000 points total in the plot. Allowed values range from 300 to 2500.
In addition to the above settings, users can select between coloring options for the UMAP plot. The currently supported coloring options are:
- By Dataset: points will be colored based on if they belong to the baseline or primary dataset.
- By Prediction Label: points will be colored according to the prediction label obtained from the model.
- By Actual Label: points will be colored according to the actual label obtained from the model.
- By Correctness: points will be colored based on whether or not the prediction was correct (i.e., does the prediction label match the actual label).
- By Confusion Matrix: after selecting a positive class, points will be colored by their confusion matrix value (true positive, true negative, false positive, or false negative).
- By Feature: Identify patterns or insights in slices of data by choosing to color by feature
For further investigation, in addition to the drift tracking on top, users can also leverage the 2D and 3D UMAP visualization of their data to identify new or emerging patterns and resolve the issue.
The UMAP visualizes the point in time selected in the drift visualization above.
To begin troubleshooting, follow these steps.
- Select a point in time when the drift was low and select a UMAP visualization in 2D. As seen in the image below, both training and production data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.
UMAP in 2D, with low drift
- Select a point in time when the drift was high and select a UMAP visualization in 2D. In this instance, both training and production data are superimposed for the most part, but another cluster of production data has appeared. This indicates that the model is seeing data in production qualitatively different from the data it was trained on, and in this case, causing performance degradation.
New cluster seen in production during higher drift time period
- For further inspection, select a 3D UMAP view and click Explore UMAP to expand the view. With this view, we can interact in 3D with our dataset. Users can zoom, rotate, and drag to see the areas of the dataset that are most interesting.
Generate 3D UMAP and click Explore
Click on any embedding to see the prediction and actual label/score and raw data associated with it for additional troubleshooting. You can view more details about a particular event by clicking view details.
Prediction Details view
To select more than one embedding, select the lasso feature, and draw a circle around the group of embeddings.
Choose to view the embeddings by different coloring options to better understand your data. Select between:
- Color by Dataset
- Color by Prediction Label
- Color by Actual Label
- Color by Correctness (correct vs. incorrect predictions)
- Color by Confusion Matrix
- Color by Feature
Check out the full workflow below.
Learn more about embeddings and troubleshooting with Arize: