Arize AI
Search…
⌃K

arize.pandas (Batch)

Batch Logging - Designed for sending batches of data to Arize

Overview

The Pandas API is designed for either proof of concept (POC) or production environment where batches of data are processed. These environments may be either a Jupyter Notebook or a python server that is batch processing lots of backend data.
Import and initialize Arize client from arize.pandas.logger to call Client.log() with a pandas.DataFrame containing inference data.
Client.log() returns a requests.models.Response object. You can check its http status code to ensure successful delivery of records.
This API uses fast serialization to the file system from Python and followed up by a fast client to server upload. It does require storage in the file system for the file being uploaded.

Initialize Arize Client

from arize.pandas.logger import Client, Schema
API_KEY = 'ARIZE_API_KEY'
SPACE_KEY = 'YOUR SPACE KEY'
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
If using version < 4.0.0, replacespace_key=SPACE_KEY with organization_key=SPACE_KEY

Parameters & Returns

response = arize_client.log(
dataframe,
path,
model_id,
model_version,
metrics_validation,
batch_id,
model_type,
environment,
sync,
surrogate_explanability,
schema=Schema(
prediction_id_column_name,
feature_column_names,
embedding_feature_column_names,
tag_column_names,
timestamp_column_name,
prediction_label_column_name,
prediction_score_column_name,
actual_label_column_name,
shap_values_column_names
))
type(response) == requests.models.Response
Parameter
Data Type
Description
Required
dataframe
Pandas.DataFrame
The dataframe with your predictions
Required
model_id
string
The unique identifier for your model
Required
model_version
string
Used to group together a subset of predictions and actuals for a given model_id
Required for logging predictions. Optional for logging actuals or shap values.
batch_id
string
Used to distinguish different batch of data under the same model_id and model_version.
Optional. Only applicable and required for validation environment.
model_type
arize.utils.types.ModelTypes
Declared what model type this prediction is for.
Required
metrics_validation
arize.utils.types.Metrics
A list of desired metric groups; defaults to None. When populated, and if validate=True, the presence of schema columns are validated against the desired metrics.
Optional
environment
arize.utils.types.Environments
The environment that this dataframe is for (Production, Training, Validation)
Required
schema
arize.pandas.logger.Schema
A Schema instance that specifies the column names for corresponding data in the dataframe. More details below
Required
path
string
Temporary directory/file to store the serialized data in binary before sending to Arize
Optional
sync
boolean
When sync is set to True, the log call will block, or wait, until the data has been successfully ingested by the platform and immediately return the status of the log.
Optional
surrogate_explainability
boolean
Computes feature importance values using the surrogate explainability method. This requires that the arize module is installed with the [MimicExplainer] option. If feature importance values are already specified by the shap_values_column_names attribute in the Schema, this module will not run.
Optional

Schema Attributes

Attribute
Data Type
Description
Required
prediction_id_column_name
str
Column name for prediction_id. The content of this column must be string.
Required
feature_column_names
List[str]
List of column names for features. The content of this column can be int, float, string.
Optional
embedding_feature_column_names
Dict[str, EmbeddingColumnNames]
Dictionary mapping embedding display names to EmbeddingColumnNames objects
Optional
timestamp_column_name
str
Column name for timestamps. The content of this column must be int Unix Timestamps in seconds.
Optional
prediction_label_column_name
str
Column name for prediction label. The content of this column must be string.
Optional
prediction_score_column_name
str
Column name for prediction scores. The content of this column must be int/float.
Optional
actual_label_column_name
str
Column name for actual label. The content of this column must be string.
Optional
actual_score_column_name
str
Column name for actual scores, or relevance scores in ranking model. The content of this column must be int/float.
Optional
tag_column_names
List[str]
List of column names for tags. The content of this column can be int, float, string.
Optional
shap_values_column_names
Dict[str, str]
dict of k-v pairs where k is the feature_colname and v is the corresponding shap_val_col_name. For example, your dataframe contains features columnsfeat1, feat2, feat3,...and corresponding shap value columns feat1_shap, feat2_shap, feat3_shap,... You want to set shap_values_column_names = {"feat1": "feat1shap", "feat2": "feat2_shap:", "feat3": "feat3_shap"}
Optional
prediction_group_id_column_name
str
Column name for ranking groups or lists in ranking models. The content of this column must be string.
Required for ranking model
rank_column_name
str
Column name for rank of each element on the its group or list. The content of this column must be integer between 1-100.
Required for ranking model

Embedding Column Names

Arize's Embedding object is formed by 3 pieces of information: the vector (required), the data (optional) and the link to data (optional). When creating a batched job, we need to map up to 3 columns in a table to a single embedding feature, as opposed to the 1:1 relationship that exists with regular features. For this purpose, Arize provides the EmbeddingColumnNames object.
Attribute
Data Type
Description
Required
vector_column_name
str
Column name for the vector of a given embedding feature. The contents of this column must be List[float] or nd.array[float].
Required
data_column_name
str
Column name for the data of a given embedding feature, typically the raw text associated with the embedding vector. The contents of this column must be str or List[str].
Optional
link_to_data_column_name
str
Column name for the link to data of a given embedding feature, typically a link to the data file (image, audio, ...) associated with the embedding vector. The contents of this column must be str.
Optional
NOTE: Currently, Arize only supports links to image files.

Model Types

Specifying model_type is optional, but we recommend doing so for the first time. When logging a prediction for the first time for a new model, we classify the model in the Arize platform automatically based on the data type of the prediction.
from arize.utils.types import ModelTypes
Model Type Use Case
SDK ModelType
Regression
ModelTypes.REGRESSION
Binary Classification
ModelType.BINARY_CLASSIFICATION
Multi Class
ModelType.SCORE_CATEGORICAL
Ranking
ModelType.RANKING
Natural Language Processing (NLP)
ModelType.SCORE_CATEGORICAL
Computer Vision (CV)
ModelType.SCORE_CATEGORICAL

Code Example

response = arize_client.log(
dataframe=your_sample_df,
path="inferences.bin",
model_id="fraud-model",
model_version="1.0",
batch_id=None,
model_type=ModelTypes.REGRESSION,
metrics_validation=[Metrics.REGRESSION],
environment=Environments.PRODUCTION,
schema = Schema(
prediction_id_column_name="prediction_id",
timestamp_column_name="prediction_ts",
prediction_label_column_name="prediction_label",
actual_label_column_name="actual_label",
feature_column_names=feature_cols,
tag_column_names=tag_cols,
shap_values_column_names=dict(zip(feature_cols, shap_cols))
)
)

Benchmark tests of the Arize Python SDK

The ability to ingest data with low latency is important, here is a benchmarking colab that demonstrates the efficiency which Arize uploads data from a Python environment
Sending 10 Million Inferences to Arize in 90 Seconds

Convert mixed type columns to Float

Data ingestion rejects datasets with mixed type columns. These columns should be converted to Float before sending. Below is an example of a mixed type column in Pandas an how to convert it.
import pandas as pd
# Example Series with mixed types
mixed = pd.Series([1, "", 2]) # it has numbers and strings
mixed.dtype # dtype('O')
# It should be converted to float
# Replace "" with NaN
mixed = mixed.replace("", float("NaN"))
mixed.dtype # dtype('float64')
Questions? Email us at [email protected] or Slack us in the #arize-support channel