Natural Language Processing (NLP)

How to log your model schema for text classification use cases

NLP Model Overview

Text Classification Models predict the categories a piece of text might belong to.

NLP CasesExpected FieldsPerformance Metrics

*prediction label, actual label, prediction score, actual score

Accuracy, Recall, Precision, FPR, FNR, F1, Sensitivity, Specificity

*prediction label, actual label, prediction score, actual score

Accuracy, Recall, Precision, FPR, FNR, F1, Sensitivity, Specificity

*all classification variant specifications apply to the NLP model type, with the addition of embeddings

Code Example

The EmbeddingColumnNames class constructs your embedding objects. You can log them into the platform using a dictionary that maps the embedding feature names to the embedding objects. See our API reference for more details.

Example Row

text_vectortextprediction_labelactual_labelprediction_scoreactual_scoreTimestamp
[4.0, 5.0, 6.0, 7.0]
"This is a test sentence"

positive

neutral

0.3

1

1618590882
from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments, EmbeddingColumnNames

API_KEY = 'ARIZE_API_KEY'
SPACE_KEY = 'YOUR SPACE KEY'
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)


# Declare which columns are the feature columns
feature_column_names=[
    "MERCHANT_TYPE", 
    "ENTRY_MODE", 
    "STATE", 
    "MEAN_AMOUNT", 
    "STD_AMOUNT", 
    "TX_AMOUNT",
]

# feature & tag columns can be optionally defined with typing:
tag_columns = TypedColumns(
    inferred=["name"],
    to_int=["zip_code", "age"]
)

# Declare embedding feature columns
embedding_feature_column_names = {
    # Dictionary keys will be the name of the embedding feature in the app
    "embedding_display_name": EmbeddingColumnNames(
        vector_column_name="text_vector",  # column name of the vectors, required
        data_column_name="text", # column name of the raw data vectors are representing, optional
    )
}

# Defina the Schema, including embedding information
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="PREDICTION",
    prediction_score_column_name="PREDICTION_SCORE",
    actual_label_column_name="ACTUAL",
    actual_score_column_name="ACTUAL_SCORE",
    feature_column_names=feature_column_names,
    embedding_feature_column_names=embedding_feature_column_names,
    tag_column_names=tag_columns,
)

# Log the dataframe with the schema mapping 
response = arize_client.log(
    model_id="sample-model-1",
    model_version= "v1",
    model_type=ModelTypes.SCORE_CATEGORICAL,
    environment=Environments.PRODUCTION,
    dataframe=test_dataframe,
    schema=schema,
)

NLP Embedding Features

Arize supports logging the embedding features associated with the text the model is acting on and the text itself using the EmbeddingColumnNames object.

  • The vector_column_name should be the name of the column where the embedding vectors are stored. The embedding vector is the dense vector representation of the unstructured input. ⚠️ Note: embedding features are not sparse vectors.

  • The data_column_name should be the name of the column where the raw text associated with the vector is stored. It is the field typically chosen for NLP use cases. The column can contain both strings (full sentences) or a list of strings (token arrays).

{ 
    "embedding_display_name": EmbeddingColumnNames(
        vector_column_name="text_vector", 
        data_column_name="text" 
    ) 
}

See here for more information on embeddings and options for generating them.

Last updated

Copyright © 2023 Arize AI, Inc