arize.pandas (Batch)
Batch Logging - Designed for sending batches of data to Arize
The Pandas API is designed for either proof of concept (POC) or production environment where batches of data are processed. These environments may be either a Jupyter Notebook or a python server that is batch processing lots of backend data.
Import and initialize Arize client from
arize.pandas.logger
to call Client.log()
with a pandas.DataFrame containing inference data. Client.log()
returns a requests.models.Response
object. You can check its http status code to ensure successful delivery of records.This API uses fast serialization to the file system from Python and followed up by a fast client to server upload. It does require storage in the file system for the file being uploaded.
from arize.pandas.logger import Client, Schema
API_KEY = 'ARIZE_API_KEY'
SPACE_KEY = 'YOUR SPACE KEY'
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
If using version < 4.0.0, replace
space_key=SPACE_KEY
with organization_key=SPACE_KEY
response = arize_client.log(
dataframe,
path,
model_id,
model_version,
metrics_validation,
batch_id,
model_type,
environment,
sync,
surrogate_explanability,
schema=Schema(
prediction_id_column_name,
feature_column_names,
embedding_feature_column_names,
tag_column_names,
timestamp_column_name,
prediction_label_column_name,
prediction_score_column_name,
actual_label_column_name,
shap_values_column_names
))
type(response) == requests.models.Response
Parameter | Data Type | Description | Required |
---|---|---|---|
dataframe | Pandas.DataFrame | The dataframe with your predictions | Required |
model_id | string | The unique identifier for your model | Required |
model_version | string | Used to group together a subset of predictions and actuals for a given model_id | Required for logging predictions. Optional for logging actuals or shap values. |
batch_id | string | Used to distinguish different batch of data under the same model_id and model_version. | Optional. Only applicable and required for validation environment. |
model_type | arize.utils.types.ModelTypes | Declared what model type this prediction is for. | Required |
metrics_validation | arize.utils.types.Metrics | A list of desired metric groups; defaults to None. When populated, and if validate=True, the presence of schema columns are validated against the desired metrics. | Optional |
environment | arize.utils.types.Environments | The environment that this dataframe is for (Production, Training, Validation) | Required |
schema | arize.pandas.logger.Schema | A Schema instance that specifies the column names for corresponding data in the dataframe. More details below | Required |
path | string | Temporary directory/file to store the serialized data in binary before sending to Arize | Optional |
sync | boolean | When sync is set to True, the log call will block, or wait, until the data has been successfully ingested by the platform and immediately return the status of the log. | Optional |
surrogate_explainability | boolean | Computes feature importance values using the surrogate explainability method. This requires that the arize module is installed with the [MimicExplainer] option. If feature importance values are already specified by the shap_values_column_names attribute in the Schema, this module will not run. | Optional |
Attribute | Data Type | Description | Required |
---|---|---|---|
prediction_id_column_name | str | Column name for prediction_id. The content of this column must be string. | Required |
feature_column_names | List[str] | List of column names for features. The content of this column can be int, float, string. | Optional |
embedding_feature_column_names | Dict[str, EmbeddingColumnNames] | Optional | |
timestamp_column_name | str | Column name for timestamps.
The content of this column must be int Unix Timestamps in seconds. | Optional |
prediction_label_column_name | str | Column name for prediction label. The content of this column must be string. | Optional |
prediction_score_column_name | str | Column name for prediction scores. The content of this column must be int/float. | Optional |
actual_label_column_name | str | Column name for actual label. The content of this column must be string. | Optional |
actual_score_column_name | str | Column name for actual scores, or relevance scores in ranking model. The content of this column must be int/float. | Optional |
tag_column_names | List[str] | List of column names for tags. The content of this column can be int, float, string. | Optional |
shap_values_column_names | Dict[str, str] | dict of k-v pairs where k is the feature_colname and v is the corresponding shap_val_col_name. For example, your dataframe contains features columns feat1, feat2, feat3,... and corresponding shap value columns feat1_shap, feat2_shap, feat3_shap,... You want to set shap_values_column_names = {"feat1": "feat1shap", "feat2": "feat2_shap:", "feat3": "feat3_shap"} | Optional |
prediction_group_id_column_name | str | Column name for ranking groups or lists in ranking models. The content of this column must be string. | Required for ranking model |
rank_column_name | str | Column name for rank of each element on the its group or list. The content of this column must be integer between 1-100. | Required for ranking model |
Arize's Embedding object is formed by 3 pieces of information: the vector (required), the data (optional) and the link to data (optional). When creating a batched job, we need to map up to 3 columns in a table to a single embedding feature, as opposed to the 1:1 relationship that exists with regular features. For this purpose, Arize provides the
EmbeddingColumnNames
object. Attribute | Data Type | Description | Required |
---|---|---|---|
vector_column_name | str | Column name for the vector of a given embedding feature. The contents of this column must be List[float] or nd.array[float] . | Required |
data_column_name | str | Column name for the data of a given embedding feature, typically the raw text associated with the embedding vector. The contents of this column must be str or List[str] . | Optional |
link_to_data_column_name | str | Column name for the link to data of a given embedding feature, typically a link to the data file (image, audio, ...) associated with the embedding vector. The contents of this column must be str . | Optional |
NOTE: Currently, Arize only supports links to image files.
Specifying
model_type
is optional, but we recommend doing so for the first time. When logging a prediction for the first time for a new model, we classify the model in the Arize platform automatically based on the data type of the prediction.from arize.utils.types import ModelTypes
Model Type Use Case | SDK ModelType |
---|---|
Regression | ModelTypes.REGRESSION |
Binary Classification | ModelType.BINARY_CLASSIFICATION |
Multi Class | ModelType.SCORE_CATEGORICAL |
Ranking | ModelType.RANKING |
Natural Language Processing (NLP) | ModelType.SCORE_CATEGORICAL |
Computer Vision (CV) | ModelType.SCORE_CATEGORICAL |
response = arize_client.log(
dataframe=your_sample_df,
path="inferences.bin",
model_id="fraud-model",
model_version="1.0",
batch_id=None,
model_type=ModelTypes.REGRESSION,
metrics_validation=[Metrics.REGRESSION],
environment=Environments.PRODUCTION,
schema = Schema(
prediction_id_column_name="prediction_id",
timestamp_column_name="prediction_ts",
prediction_label_column_name="prediction_label",
actual_label_column_name="actual_label",
feature_column_names=feature_cols,
tag_column_names=tag_cols,
shap_values_column_names=dict(zip(feature_cols, shap_cols))
)
)
The ability to ingest data with low latency is important, here is a benchmarking colab that demonstrates the efficiency which Arize uploads data from a Python environment
Sending 10 Million Inferences to Arize in 90 Seconds |
Data ingestion rejects datasets with mixed type columns. These columns should be converted to Float before sending. Below is an example of a mixed type column in Pandas an how to convert it.
import pandas as pd
# Example Series with mixed types
mixed = pd.Series([1, "", 2]) # it has numbers and strings
mixed.dtype # dtype('O')
# It should be converted to float
# Replace "" with NaN
mixed = mixed.replace("", float("NaN"))
mixed.dtype # dtype('float64')
Last modified 13d ago