Cloud Storage File Imports
Automatically sync model inferences to your cloud storage provider
Arize syncs with cloud storage providers to continuously track and transcribe model inference data as Arize model records.
Create a file import job to load model inferences from your cloud storage bucket to Arize. File import jobs can be set up based on file location, file schema, and basic model configurations.
Arize checks for new files regularly. If changes are detected, your new files are automatically ingested into Arize.

Arize supports CSV, Parquet, Avro, and Apache Airflow for ingesting model inferences. Your model schema varies based on model types. Learn more about model types here.
One model can equate to multiple jobs depending on various parameters. To easily import model inferences, configure jobs based on distinct bucket locations and set the correct model and environment.
There are multiple ways to structure model data. To easily ingest model inference data from storage, adopt a standardized directory structure across all models.
This example contains model predictions and actuals in separate files, this helps in cases of delayed actuals. Learn more here.
Capture parameters such as
version
, batchId
, and features
inside of your files to automatically translate this information in your model history. If your model version is expected to change often, we recommend sending the version data with the model data in the file. s3://click-thru-rate/
├── prediction-folder/
│ ├── 12-1-2022.parquet #this file can contain multiple versions
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── actuals-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet
In this example, the predictions and actuals are stored in the same file, so they are ingested simultaneously. This example also separates model environments (production and training).
s3://click-thru-rate/
├── production-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── training-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet
If your model receives delayed actuals, upload different files for predictions versus actuals using the same prediction ID, which links your files together in the Arize platform. Arize will check in the same bucket prefix for both predictions and actuals and ingest them separately as they become available.
Model schema parameters are a way of organizing model inference data to ingest to Arize. When configuring your schema, be sure to match your file column headers with the model schema. When configuring the file schema, you can either use a form or a simple JSON-based schema to specify the column mapping.
(with explicit feature and tag column prefixes)
prediction_id | prediction_ts | user_id | tags/zone | feature/metropolitan_area | industry | prediction_score | actual_score | prediction_label | actual_label |
---|---|---|---|---|---|---|---|---|---|
1fcd50f4689 | 1637538845 | 82256 | us-east-1 | 1PA | engineering | 0.07773696 | 0 | No Claims | No Claims |
(without explicit feature column prefixes for implicit ingestion)
prediction_id | prediction_ts | user_id | tags/zone | metropolitan_area | industry | prediction_score | actual_score | prediction_label | actual_label |
---|---|---|---|---|---|---|---|---|---|
1fcd50f4689 | 1637538845 | 82256 | us-east-1 | 1PA | engineering | 0.07773696 | 0 | No Claims | No Claims |
{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit)
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/", # requires explicit column declaration
"shap_values": "shap/", # requires explicit column declaration
"exclude": ["user_id"]
}
Property | Description | Required |
---|---|---|
prediction_ID | The unique identifier of a specific prediction | Required |
timestamp | The timestamp of the prediction in seconds or an RFC3339 timestamp | Optional, defaults to current timestamp at file ingestion time |
prediction_label | Column name for the prediction value | Required |
prediction_score | Column name for the predicted score | |
actual_label | Column name for the actual or ground truth value | Optional for production records |
relevance_label | Column name for ranking actual or gound truth value | |
actual_score | Column name for the ground truth score | |
relevance_score | Column name for ranking ground truth score | |
features | A string prefix to describe a column feature/ . Features must be sent in the same file as predictions | Arize automatically infers columns as features. Choose between feature prefixing OR inferred features. |
tags | A string prefix to describe a column tag/ . Tags must be sent in the same file as predictions and features | Optional |
shap_values | A string prefix to describe a column shap/ . SHAP must be sent in the same file as predictions or with a matching prediction_id | Optional |
version | A column to specify model version. version/ assigns a version to the corresponding data withina column, or configure your version witin the UI | Optional, defaults to 'no_version' |
batch_id | Distinguish different batches of data under the same model_id and model_version. Must be specified as a constant during job setup or in the schema | Optional for validation records only |
exclude | A list of columns to exclude if the features property is not included in the ingestion schema | Optional |
embedding_features | A list of embedding columns, required vector column, optional raw data column, and optional link to data column. Learn more here | Optional |
Jobs are created and managed per-space basis, so jobs are localized within a workspace. This enables administrators to control model inference access as needed.
For enterprise users, RBAC (role-based access control) controls the write access (file import job access) to models within a workspace. Thus, users who have write access can create new jobs and ingest new models into Arize.
Our public-facing GraphQL API provides a direct path between your cloud storage and the Arize platform to easily set up jobs and automate job creation. The guide below will explain how to use the File Importer API for programmatic job creation.
Last modified 6d ago