Embeddings FAQ
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
There are many ways to extract embeddings depending on your use-case and your model. You can either generate your own embeddings or let Arize generate them for you.
Examples can be found here:
The UMAP visualization requires each datapoint's prediction ID to be unique. If there are are multiple predictions sent with the same prediction ID, the UMAP visualization cannot fetch all of the columns (features, tags, etc) of that datapoint. Since certain fields of this datapoint cannot be fetched, some color by options will be restricted.
Navigate here for step-by-step instructions to enable access to view private AWS S3 image links.
Different embedding features can have vectors with different dimensionality. However, Arize currently only supports one vector dimensionality per embedding feature. We are working to support multiple dimensionalities within the same embedding feature.
The vector attribute of Arize's embedding object must be an array of floats. Strings are not allowed.
feature_column_names
, or just the embedding_feature_column_names
?Regular features and embedding features are ingested into Arize in two different list of column names. In short, embedding column names should not be included in feature_column_names
. Check out our resources to learn more.
Euclidean distance identifies movements of embeddings across many use cases in testing. There will be support for more metrics, i.e., cosine similarity, as the ecosystem develops. Learn more on monitoring embedding drift here.
See our glossary page on Euclidean distance.
Inside of the Arize platform, Euclidean distance is calculated using the original embeddings, not the UMAP projections. For visualization purposes, we take a sample from those embeddings and, using UMAP, project them into a 2D, or 3D space.
Any use cases where embeddings or the ability to extract embeddings can be used. A few examples are computer vision, natural language processing, deep learning, hierarchical embedding use cases.
You can set a euclidean distance monitor using the UI or through our Monitors API. By creating a monitor and selecting the embeddings feature of interest, Arize can track and monitor your embeddings for drift.
UMAP was chosen to visualize and understand large, high-dimensional datasets, as it maintains local & global structures and it scales better than other dimension reduction techniques (learn more here). See here to learn more about using UMAP in Arize.
Arize has the fortune to count Dr. Leland McInnes (one of the creators of UMAP) from the Tutte Institute for Mathematics and Computing as an advisor. He continues to help us develop capabilities in the space.