Embeddings FAQ

Ingestion

How do you generate the embeddings?

There are many ways to extract embeddings depending on your use-case and your model. You can either generate your own embeddings or let Arize generate them for you.

How do you send the embeddings to Arize?

Examples can be found here:

Image Classification Natural Language Processing (NLP)

The UMAP views says "Duplicate Prediction IDs found in Dataset.". What does that mean?

The UMAP visualization requires each datapoint's prediction ID to be unique. If there are are multiple predictions sent with the same prediction ID, the UMAP visualization cannot fetch all of the columns (features, tags, etc) of that datapoint. Since certain fields of this datapoint cannot be fetched, some color by options will be restricted.

How do I view my private AWS S3 images in Arize?

Navigate here for step-by-step instructions to enable access to view private AWS S3 image links.

Do all vectors from the same embedding feature need to have the same dimensionality?

Different embedding features can have vectors with different dimensionality. However, Arize currently only supports one vector dimensionality per embedding feature. We are working to support multiple dimensionalities within the same embedding feature.

Should the embedding vector contain numeric values? Are strings allowed?

The vector attribute of Arize's embedding object must be an array of floats. Strings are not allowed.

Should the embedding column name be included in the `feature_column_names` , or just the `embedding_feature_column_names`?

Regular features and embedding features are ingested into Arize in two different list of column names. In short, embedding column names should not be included in feature_column_names. Check out our resources to learn more.

Drift

Why Euclidean distance?

Euclidean distance identifies movements of embeddings across many use cases in testing. There will be support for more metrics, i.e., cosine similarity, as the ecosystem develops. Learn more on monitoring embedding drift here.

How is Euclidean distance calculated?

See our glossary page on Euclidean distance.

Is Euclidean distance calculated using the original embeddings or the UMAP projections?

Inside of the Arize platform, Euclidean distance is calculated using the original embeddings, not the UMAP projections. For visualization purposes, we take a sample from those embeddings and, using UMAP, project them into a 2D, or 3D space.

What use-cases can Euclidean distance apply to?

Any use cases where embeddings or the ability to extract embeddings can be used. A few examples are computer vision, natural language processing, deep learning, hierarchical embedding use cases.

How can you monitor drift in embeddings?

You can set a euclidean distance monitor using the UI or through our Monitors API. By creating a monitor and selecting the embeddings feature of interest, Arize can track and monitor your embeddings for drift.

Visualization

Why UMAP?

UMAP was chosen to visualize and understand large, high-dimensional datasets, as it maintains local & global structures and it scales better than other dimension reduction techniques (learn more here). See here to learn more about using UMAP in Arize.

Arize has the fortune to count Dr. Leland McInnes (one of the creators of UMAP) from the Tutte Institute for Mathematics and Computing as an advisor. He continues to help us develop capabilities in the space.

Last updated 1 year ago

Was this helpful?

Ingestion

How do you generate the embeddings?

How do you send the embeddings to Arize?

The UMAP views says "Duplicate Prediction IDs found in Dataset.". What does that mean?

How do I view my private AWS S3 images in Arize?

Do all vectors from the same embedding feature need to have the same dimensionality?

Should the embedding vector contain numeric values? Are strings allowed?

Should the embedding column name be included in the feature_column_names , or just the embedding_feature_column_names?

Drift

Why Euclidean distance?

How is Euclidean distance calculated?

Is Euclidean distance calculated using the original embeddings or the UMAP projections?

What use-cases can Euclidean distance apply to?

How can you monitor drift in embeddings?

Visualization

Why UMAP?

Should the embedding column name be included in the `feature_column_names` , or just the `embedding_feature_column_names`?