Sending Data FAQ
When sending delayed actuals, specify the
model_id
in your schema to match your actuals to the correct model. Delayed actuals are mapped back to predictions via a
model_id
and prediction_id
, regardless of version. This means that if you have the same prediction_id
in multiple model versions, the actual will be joined to each row with the matching prediction_id
.If you send actuals to Arize to log delayed actuals (when preexisting predictions already exist in Arize), Arize will join the delayed actuals with the correlating prediction IDs in the platform at 5 AM UTC daily.
However, if you have never logged predictions for your model, you must upload prediction values corresponding to your actuals (using the same prediction ID) to view your model in Arize.
The Arize engineering support team can extend your connection windows. Reach out to [email protected] for help with this.
The configurations are applied at the space level. All models in the space will receive the same join window configuration.
Arize only calculates performance metrics on predictions that have actuals. Null rows are ignored when calculating performance metrics. If actuals have not been received yet (delayed actuals), refer to our default actual support information on how Arize handles nulls for your use case.
The data ingestion date is the timestamp for when the data was received by Arize. The data ingestion tab in the Arize UI shows the real-time data ingestion stats daily.
The prediction timestamp represents when your model's prediction was made (time of inference). This timestamp is a column of data sent to Arize with the model inference data. Arize time series charts are based on the prediction timestamp.
Your timestamp can be in seconds or an RFC3339 timestamp. If you do not specify a timestamp, your timestamp will default to the time of file ingestion.
Arize supports sending in historical data with prediction timestamps up to 2 years before the current timestamp. The data sent to Arize will be retained and visible in Arize for up to 2 years. Reach out to [email protected] to extend this window.
The current cap is 10 million rows; we recommend splitting the data into multiple files for larger uploads.
What is the advised file size to optimize upload performance?
While our system has the capacity to accommodate files up to 1GB, we find that files around the 50MB mark usually provide the best balance between volume of data and system performance. This size typically represents less than a million rows of data.
If the prediction timestamp column isn't correctly set, import jobs may result in parsing errors and fail. To make sure this doesn't happen, ensure that:
- the timestamp format is in seconds (not something more granular) or RFC3339
- the timestamp is within a year of today's date (either past or future)
Ensure Training and Validation records must include both
prediction
and actual
columns. Otherwise, it will result in a data validation error. If the data type expected is numeric, but comes in as a string
- Ensure there are no string values in numeric columns
- If
None
orNull
values are used to represent empty numerics, represent them instead asNaN
Arize connectors are classified into two key categories: Data Warehouses/Lakes and Object Stores that fulfill distinctive data management and analytical needs.
- Data Warehouse/Lakes(Snowflake, GBQ, Databricks): Used for real-time analysis of high-volume datasets. They are the prime choice for quick, continual, and comprehensive data exploration.
- Object Store(GCS, AWS S3, Azure): Designed for reliable and cost-effective storage of substantial data quantities, these connectors are ideal for backup and archival purposes. They are the go-to option when long-term data preservation and accessibility are important.
If your embeddings or ranking data are part of a CSV file, convert the data to Parquet before uploading your data to Arize. To convert your file:
import pandas as pd
import math
"""
parse_vector is a helper function that converts a string representation
of an embeddings vector to a list. Modify the separator and delimiters per the user's vector representation
Example of a string vector representation:
'[float1, float2, float3]'
"""
def parse_vector(v, sep=',', delim='[]'):
if not isinstance(v, str) and math.isnan(v):
return v
if isinstance(v, str) and v.lower() == "nan":
return float('nan')
v_list = v.strip(delim).replace('\n','').split(sep)
return [float(k.strip(" ")) for k in v_list]
df = pd.read_csv("<file name>.csv")
df[<vector column>] = df.apply(lambda x: parse_vector(x[<vector column>]), axis=1)
df.to_parquet("<file name>.parquet")
The contents within a can file vary based on model type - with the exception of required fields (ie
prediction_id
). Predictions Only
CSV
Parquet
Avro
Apache Arrow
BigQuery
Input Data Field | Data Type |
---|---|
prediction_id | string |
prediction_label /actual_label | string / int / float |
prediction_score /actual_score | int / float |
timestamp |
|
features |
|
tags |
|
shap_values |
|
Input Data Field | Parquet Data Type |
---|---|
prediction_id |
|
prediction_label /actual_label |
|
prediction_score /actual_score |
|
timestamp |
|
features /tags |
|
shap_values |
|
embedding_feature:vector | list of {int8|int16|int32|int64|float32|float64} |
embedding_feature:raw_data |
|
embedding_feature:link_to_data | string |
ranking:prediction_group_id |
|
ranking:rank | int8 , int16 , int32 , int64 |
ranking:category | array of strings |
ranking:relevance_score |
|
Use the Avro schema embedded in the header of the Avro Object Container File(OCF) to decode and match to those specified in Arize file importer schema for data ingestion. The field name in the OCF file must
- start with [A-Za-z_]
- subsequently contain only [A-Za-z0-9_]
Input Data Field | Avro Data Type |
---|---|
prediction_id | long , int , string |
prediction_label /actual_label |
|
prediction_score /actual_score | int , long , float , double |
timestamp |
|
features/tags |
|
shap_values | int , long , float , double |
embedding_feature:vector | array of {int|long|float|double} |
embedding_feature:raw_data |
|
embedding_feature:link_to_data | string |
ranking:prediction_group_id | long , int , string |
ranking:rank | int , long |
ranking:category | array of strings |
ranking:relevance_score | int , long , float , double |
This example shows what an Arrow file columns and schema file would look like.
The "*" can be used to add features to a file without changing the schema
Column Name in File | Arize Schema |
---|---|
my-prediction-ts | prediction_timestamp |
my-prediction-id-customer | prediction_id |
my-prediction-score | prediction_score |
my-prediction-label | prediction_label |
my-feature.addr_state | features |
my-feature.revenue | features |
my-environment | environment |
my-actual-label | actual_label |
Note the name "my-feature" has multiple feature values.
ModelSchema:
prediction_timestamp: "my-prediction_ts"
prediction_id: "my-prediction-id-customer"
prediction_score: "my-prediction-score"
prediction_label: "my-prediction-label"
features: "feature.*" # describes the path to the "features" object above, containing "addr_state" and "revenue"
Input Data Field | GBQ Data Type |
---|---|
prediction_id | INT64 , STRING |
change_timestamp | TIMESTAMP |
prediction_label /actual_label |
|
prediction_score /actual_score | INT64 , NUMERIC , FLOAT64 |
timestamp |
|
features/tags |
|
shap_values | INT64 , FLOAT64 , NUMERIC |
embedding_feature:vector | ARRAY of {INT64, FLOAT64, NUMERIC} |
embedding_feature:raw_data |
|
embedding_feature:link_to_data | STRING |
ranking:prediction_group_id | INT64 , STRING |
ranking:rank | INT64 |
ranking:category | ARRAY of STRING |
ranking:relevance_score | INT64 , FLOAT64 , NUMERIC |
I get this error when I try
pip install arize[MimicExplainer]
:no matches found: arize[MimicExplainer]
Some shells (zsh) may interpret the brackets in a special way. In order to get around this, you may need to escape the brackets with a backslack:
!! pip install arize\[MimicExplainer\]
. I get this error when pip is installing
LightGBM
(a dependency of the MimicExplainer): INFO:LightGBM:Starting to compile with CMake.
...
FileNotFoundError: [Errno 2] No such file or directory: 'cmake'
or
subprocess.CalledProcessError: Command '['make', '_lightgbm', ...]' returned non-zero exit status 2."
This is because pip is attempting to compile
lib_lightgbm
- the C library for LightGBM. The compile process needs cmake
as well as Open MP
. To install this on a MacOS, try this first:brew install cmake libomp
(with explicit feature and tag column prefixes)
prediction_id | prediction_ts | user_id | tags/zone | feature/metropolitan_area | industry | prediction_score | actual_score | prediction_label | actual_label |
---|---|---|---|---|---|---|---|---|---|
1fcd50f4689 | 1637538845 | 82256 | us-east-1 | 1PA | engineering | 0.07773696 | 0 | No Claims | No Claims |
(without explicit feature column prefixes for implicit ingestion)
prediction_id | prediction_ts | user_id | tags/zone | metropolitan_area | industry | prediction_score | actual_score | prediction_label | actual_label |
---|---|---|---|---|---|---|---|---|---|
1fcd50f4689 | 1637538845 | 82256 | us-east-1 | 1PA | engineering | 0.07773696 | 0 | No Claims | No Claims |
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit)
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/", # requires explicit column declaration
"shap_values": "shap/", # requires explicit column declaration
"exclude": ["user_id"]
}
When you log prediction data into Arize using the Python SDK, there are two ways to check the status depending on which mode is chosen.
Batch: The pandas logger named
arize.pandas
returns a response so you can check the status with response.status_code
Real-time: The logger named
arize.log
returns a future
. See example:import concurrent.futures as cf
def arize_responses_helper(responses):
"""
responses: a list of responses from Arize
returns: None
"""
for response in cf.as_completed(responses):
res = response.result()
if res.status_code != 200:
raise ValueError(f'failed with code {res.status_code}, {res.text}')
# Logging to Arize, returns a list of responses
responses = arize.log(...) # your log call
# Check responses!
arize_responses_helper(responses)
After receiving a
200
response code, head over to your model's Data Ingestion Tab to confirm that Arize has received your data.The model's inferences are indexed by the received timestamp, NOT the timestamp of the inferences.

They are treated as separate observations. This would mean that 2 predictions sent with the same prediction ID would count as 2 predictions in the platform. If there was an
actual
sent for both predictions, it would show up as 2 separate predictions with both having a corresponding matching actual.We currently support the following data types for the corresponding columns.
Column Type | Supported Data Types |
---|---|
Features | int, float, str, bool |
Prediction ID | int, str |
Prediction Timestamps | int, float, date, datetime |
Prediction Score | int, float |
Actual Score | int, float |
SHAP values | int, float |
Supported data types for Prediction and Actual labels and scores depend on the model types.
Column Type | Score Categorical | Numeric |
---|---|---|
Prediction Label | str | int, float |
Actual Label | str | int, float |
Prediction Score | int, float | NA |
Actual Score | int, float | NA |
Actual Numeric Sequence | List[int, float] | NA |
Arize requires at least one prediction, actual, or feature importance column, but if all are missing, Arize will reject the dataset.
Arize generally accepts null values within prediction and actual columns. In order to successfully ingest null values while still constructing valid records in Arize, each row much contain least one non-null prediction, actual, or feature importance value.
For example, in the case of a regression model, it's common for actuals (ground truth) to miss values. In this case, the actuals column will include null values and non-null values. For the rows where an actual is null, a prediction value or feature importance value is required to construct a valid record.
They are accepted and treated as empty.
If you include a
prediction_id
column with null values within the column, Arize will reject the column.