Arize AI
Search…
⌃K

Supported File Types

Supported file formats and the required fields
Arize supports various file formats for ingesting model inferences. Each record in your files must capture fields that represent a model inference in the Arize platform.
CSV
Parquet
Avro
Apache Arrow

Logging Predictions, Actuals, Tags, and Shap Values together

Example File

Below is an example CSV file that contains all the necessary data to log inferences to Arize. Notice that the column headers for shap and tags are strategically named so that it's easy to identify those parts of the model record. Features will be automatically discovered as any column not already reserved in the schema declaration. Columns may also be excluded from automatic feature discovery by specifying their name in the schema's exclusion list.
In this particular example each record contains both the predictions and actuals.
Example CSV Prediction + Actual

Example Schema

The above is a CSV has both predictions and actuals in the same file.
The following columns will be automatically discovered as features: metropolitan_area, industry, state, established_business, experienced_business, safe_business, owner_gender, payroll_bracket, avg_payroll_bracket, risk_b_2, risk_bracket
The column user_id will be not be ingested into the Arize system because it's explicitly excluded.
The following schema supports the above example:
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/",
"shap_values": "shap/",
"exclude": ["user_id"]
}

Importing predictions and actuals separately

Example Files

The Arize platform allows for uploading different files for predictions versus actuals. The prediction ID links the files together.
Predictions Only
The above table is the features and predictions data. The actuals are received in a separate file that can be delayed in processing.
Actuals Only
The above file can be delivered days or weeks after the prediction file is processed. The prediction ID will join the data together.

Explicit Declaration of Features

Example File

Below is an example CSV file that is similar to the prior two examples, but uses special notation to explicitly call out columns as features rather than automatically inferring columns as features. This follows a similar convention to shap and tags.
In this particular example each record contains both the predictions and actuals.
Example Predictions + Actuals with Labeled Features

Example Schema

The above is a CSV has both predictions and actuals in the same file.
The column user_id will be not be ingested into the Arize system because it's not prefixed with "feature/.*" excluded.
The following schema supports the above example:
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/",
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/",
"shap_values": "shap/"
}

Supported Parquet Data Types

Input Data Field
Parquet Data Type
prediction_id
  • string
  • int8 , int16, int32, int64
prediction_label/actual_label
  • string
  • boolean
  • int8, int16, int32, int64
  • float32, float64
prediction_score/actual_score
  • int8, int16, int32, int64
  • float32, float64
timestamp
  • int64, float64 (number of seconds since unix epoch)
  • string RFC3339 format (e.g 2022-04-16T16:15:00Z)
  • timestamp
  • date32, date64
features/tags
  • string
  • boolean (converted to string)
  • int8, int16, int32, int64
  • float32, float64
  • decimal128 (round to 5 decimal places)
  • date32, date64, timestamp (converted to integer)
shap_values
  • int32, int64
  • float32, float64
embedding_feature:vector
list of {int8|int16|int32|int64|float32|float64}
embedding_feature:raw_data
  • string
  • list of strings
embedding_feature:link_to_data
string
ranking:prediction_group_id
  • string
  • int8 , int16, int32, int64
ranking:rank
int8, int16, int32, int64
ranking:category
array of strings
ranking:relevance_score
  • int8, int16, int32, int64
  • float32, float32

Logging Predictions, Shaps, and Actuals together

Example File

Below is an example Parquet file (transformed to csv for readability) that contains all the necessary data to log inferences to Arize. Notice that the column headers for shap and tags are strategically named so that it's easy to identify those parts of the model record. Features will be automatically discovered as any column not already reserved in the schema declaration. Columns may also be excluded from automatic feature discovery by specifying their name in the schema's exclusion list.
In this particular example each record contains both the predictions and actuals.
Example Parquet Prediction + Actual

Example Schema

The above is a Parquet file with both predictions and actuals in the same file. The schema supported for the above example is below:
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/",
"shap_values": "shap/",
"exclude": ["user_id"]
}

Importing predictions and actuals separately

Example Files

The Arize platform allows for uploading different files for predictions versus actuals. The prediction ID links the files together.
Predictions Only
The above table is the features and predictions data. The actuals are received in a separate file that can be delayed in processing.
Actuals Only
The above file can be delivered days or weeks after the prediction file is processed. The prediction ID will join the data together.

Explicit Declaration of Features

Example File

Below is an example Parquet file that is similar to the prior two examples, but uses special notation to explicitly call out columns as features rather than automatically inferring columns as features. This follows a similar convention to shap and tags.
In this particular example each record contains both the predictions and actuals.
Example Predictions + Actuals with Labeled Features

Example Schema

The above is a CSV has both predictions and actuals in the same file.
The column user_id will be not be ingested into the Arize system because it's not prefixed with "feature/.*" excluded.
The following schema supports the above example:
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/",
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/",
"shap_values": "shap/"
}

Supported Avro Data Types

Input Data Field
Avro Data Type
prediction_id
long, int, string
prediction_label/actual_label
  • string
  • boolean, int, long, float, double, enum (will be converted to string)
prediction_score/actual_score
int, long, float, double
timestamp
  • long, double (number of seconds since unix epoch)
  • string RFC3339 format (e.g 2022-04-16T16:15:00Z)
  • timestamp logical type
  • date logical type
features/tags
  • string
  • enum , boolean (will be converted to string)
  • int, long
  • float, double
shap_values
int , long, float, double
embedding_feature:vector
array of {int|long|float|double}
embedding_feature:raw_data
  • string
  • array of strings
embedding_feature:link_to_data
string
ranking:prediction_group_id
long, int, string
ranking:rank
int, long
ranking:category
array of strings
ranking:relevance_score
int , long, float, double

Sample Schema

The schema definition for creating an import job for avro files are identical to it for CSV/parquet files. Please refer to previous tabs for references.
Please note that we rely on the avro schema embedded in the header of the avro Object Container File(OCF) to decode and match to those specified in Arize file importer schema for data ingestion. The field name in the OCF file must
  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]
Please find below an example of avro schema and corresponding Arize file importer schema
// Avro File Schema
{
"type": "record",
"name": "sample_avro_record",
"fields": [
{
"name": "prediction_id",
"type": "string"
},
{
"name": "features_temperature",
"type": "int"
},
{
"name": "features_zip",
"type": "string"
},
{
"name": "tags_location",
"type": "string"
},
{
"name": "tags_day",
"type": "int"
},
{
"name": "predicted_weather",
"type": "string"
},
{
"name": "confidence",
"type": "double"
},
{
"name": "actual_weather",
"type": "string"
},
{
"name": "features_indoor_sensor",
"type": "boolean"
},
{
"name": "prediction_ts",
"type": "long",
"logicalType": "timestamp-micros"
},
{
"name": "shaps_zip",
"type": "double"
},
{
"name": "actual_severity",
"type": "long"
}
]
}
// File Importer Schema
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "features_.*",
"prediction_score": "confidency",
"prediction_label": "predicted_weather",
"actual_score": "actual_weather",
"actual_label": "actual_severity",
"tags": "tags_.*",
"shap_values": "shaps_.*"
}
This file format is currently only supported for our design partners. For early access, please contact [email protected]
This example shows what an Arrow file columns and schema file would look like.
The "*" can be used to add features to a file without changing the schema.
Column Name in File
Arize Schema
my-prediction-ts
prediction_timestamp
my-prediction-id-customer
prediction_id
my-prediction-score
prediction_score
my-prediction-label
prediction_label
my-feature.addr_state
features
my-feature.revenue
features
my-environment
environment
my-actual-label
actual_label
The above table is an example of column mappings in a file and their mappings to the Arize internal dimension types.
Note the name "my-feature" has multiple feature values.
model.json
ModelSchema:
prediction_timestamp: "my-prediction_ts"
prediction_id: "my-prediction-id-customer"
prediction_score: "my-prediction-score"
prediction_label: "my-prediction-label"
features: "feature.*" # describes the path to the "features" object above, containing "addr_state" and "revenue"
Questions? Email us at [email protected] or Slack us in the #arize-support channel