Is Euclidean distance calculated using the original embeddings or the UMAP projections?
Inside of the Arize platform, Euclidean distance is calculated using the embeddings that were sent, not the UMAP projections. The UMAP is used for visualization purposes and projecting into a 2D, or 3D space.
Why Euclidean distance?
Euclidean distance identified movements of embeddings across many use cases in testing. There will be support for more metrics like cosine similarity for instance as the ecosystem develops.
What use cases can this apply to?
Any use cases where embeddings or the ability to extract embeddings can be used. This includes computer vision, natural language processing, deep learning, hierarchical embedding use cases, for example.
UMAP was chosen to visualize and understand large, high dimensional datasets, as it maintains global structures and it scales better than other dimension reduction techniques (SNE, T-SNE).
The founder and creator of UMAP, Dr. Leland McInnes, Researcher at the Tutte Institute for Mathematics and Computing is an advisor at Arize, and in conjunction, helping to develop capabilities in the space.
How do you generate the embeddings?
There are many ways to extract embeddings
Some of these methods include:
Transform unstructured data into embeddings using pre trained embedding models such as BERT or word2Vec, as an example
Simple linear dimensionality reduction models such as SVD or PCA to create embeddings
Extracting the last few layers a NN model as embeddings
Extraction of the last few layers to generate embeddings to a DNN Model
You can set a euclidean distance monitor using the UI or through our Monitors API. By creating a monitor and selecting the embeddings feature of interest, we can track and monitor our embeddings for drift.