Let Arize Generate Your Embeddings
Only available in arize>=6.0.0
Generating embeddings is likely another problem to solve, on top of ensuring your model is performing properly. With our Python SDK, you can offload that task to Arize and we will generate the embeddings for you. We use large, pre-trained models that will capture information from your inputs and encode it into embedding vectors.
We extract the embeddings in the appropriate way depending on your use case, and we return it to you to include in your pandas DataFrame, which you then send to Arize.
Auto-Embeddings works end-to-end, you don't have to worry about formatting your inputs for the correct model. By simply passing your input, an embedding will come out as a result. We take care of everything in between.
If you want to use this functionality as part of our Python SDK, you need to install it with the extra dependencies using
pip install arize[AutoEmbeddings]
.You can use any model available in the Hugging Face Hub, public or private. If you are using a private model, you will need to authenticate with Hugging Face first.
If you are using arize<7.3.0, you will have a more restricted list of supported models. You can access it by running
from arize.pandas.embeddings import EmbeddingGenerator
EmbeddingGenerator.list_pretrained_models()
There are thousands of models available in the Hugging Face Hub. If you find one where our implementation of
AutoEmbeddings
breaks, please reach out to us at [email protected] or in our community Slack!We recommend using the same model to generate embeddings as the one generating predictions. However, if you don't have said model in the Hugging Face Hub, choosing a model to generate your embeddings can be a daunting task. The following is a list of models we have experimented with and recommend as a starting point:
Computer Vision
Natural Language Processing
Tabular Embeddings
Task | Family | Text |
---|---|---|
Image Classification | ViT | |
Image Classification | ViT | |
Image Classification | ViT | |
Image Classification | ViT | |
Object Detection | DETR | |
Object Detection | DETR |
*Note: You can replace the keyword "base" with "large" and use larger models, achieving better performance but with a higher compute time.
Task | Family | Text |
---|---|---|
Sequence Classification | BERT | |
Sequence Classification | BERT | |
Sequence Classification | BERT | |
Summarization | BERT | |
Summarization | BERT | |
Summarization | BERT |
*Note: You can replace the keyword "base" with "large" and use larger models, achieving better performance but with a higher compute time. You can also replace the word "uncased" with "cased" to use models that are case-sensitive.
Task | Family | Text |
---|---|---|
Tabular Embeddings | BERT | |
Tabular Embeddings | BERT | |
Tabular Embeddings | BERT |
*Note: You can replace the keyword "base" with "large" and use larger models, achieving better performance but with a higher compute time. You can also replace the word "uncased" with "cased" to use models that are case-sensitive.
Arize AutoEmbeddings comes with defaulted models from the list above. You can find what models are set as default for each use-case by running:
from arize.pandas.embeddings import EmbeddingGenerator
EmbeddingGenerator.list_default_models()
Auto-Embeddings is designed to require minimal code from the user. We only require two steps:
- 1.Create the generator: you simply instantiate the generator using
EmbeddingGenerator.from_use_case()
and passing information about your use case, the model to use, and more options depending on the use case; see examples below. - 2.Let Arize generate your embeddings: obtain your embeddings column by calling
generator.generate_embedding()
and passing the column containing your inputs; see examples below.
Image Classification
Object Detection
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.CV.IMAGE_CLASSIFICATION,
model_name="google/vit-base-patch16-224-in21k",
batch_size=100
)
df["image_vector"] = generator.generate_embeddings(
local_image_path_col=df["local_path"]
)
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.CV.OBJECT_DETECTION,
model_name="facebook/detr-resnet-101",
batch_size=100
)
df["image_vector"] = generator.generate_embeddings(
local_image_path_col=df["local_path"]
)
Sequence Classification
Large Language Models (LLMs)
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.NLP.SEQUENCE_CLASSIFICATION,
model_name="distilbert-base-uncased",
tokenizer_max_length=512,
batch_size=100
)
df["text_vector"] = generator.generate_embeddings(text_col=df["text"])
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.NLP.SUMMARIZATION,
model_name="distilbert-base-uncased",
tokenizer_max_length=512,
batch_size=100
)
df["document_vector"] = generator.generate_embeddings(text_col=df["document"])
df["summary_vector"] = generator.generate_embeddings(text_col=df["summary"])
Arize can generate embeddings for your tabular data as well. This is a useful way to detect and debug multivariate drift. For more information, go to the docs here.
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
# Instantiate the embeddding generator
generator = EmbeddingGenerator.from_use_case(
use_case=UseCases.STRUCTURED.TABULAR_EMBEDDINGS,
model_name="distilbert-base-uncased",
tokenizer_max_length=512,
batch_size=100
)
# Select the columns from your dataframe to consider
selected_cols = [...]
# (Optional) Provide a mapping for more verbose column names
column_name_map = {...: ...}
# Generate tabular embeddings and assign them to a new column
df["tabular_embedding_vector"] = generator.generate_embeddings(
df,
selected_columns=selected_cols,
col_name_map=column_name_map # (OPTIONAL, can remove)
)
Arize expects the DataFrame's index to be sorted and begin at 0. If you perform operations that might affect the index prior to generating embeddings, reset the index as follows:
df = df.reset_index(drop=True)
Check out our tutorials on generating embeddings for different use cases using Arize.
Use-Case | Code |
---|---|
NLP Sentiment Classification | |
CV Image Classification | |
Large Language Models | |
Embeddings for Tabular Data |
Last modified 10d ago