Arize AI
Search…
⌃K

Cloud Storage File Imports

Automatically sync model inferences to your cloud storage provider

Cloud Storage Overview

Arize syncs with cloud storage providers to continuously track and transcribe model inference data as Arize model records.

File Import Jobs

Create a file import job to load model inferences from your cloud storage bucket to Arize. File import jobs can be set up based on file location, file schema, and basic model configurations.
Arize checks for new files regularly. If changes are detected, your new files are automatically ingested into Arize.

Supported File Types

Arize supports CSV, Parquet, Avro, and Apache Airflow for ingesting model inferences. Your model schema varies based on model types. Learn more about model types here.

Dataset Structure

One model can equate to multiple jobs depending on various parameters. To easily import model inferences, configure jobs based on distinct bucket locations and set the correct model and environment.

Directory Structure

There are multiple ways to structure model data. To easily ingest model inference data from storage, adopt a standardized directory structure across all models.
Find an example schema import and schema definitions here.

Option 1: Schema Parameters Captured Within Your File

This example contains model predictions and actuals in separate files, this helps in cases of delayed actuals. Learn more here.
Capture parameters such as version, batchId, and features inside of your files to automatically translate this information in your model history. If your model version is expected to change often, we recommend sending the version data with the model data in the file.
s3://click-thru-rate/
├── prediction-folder/
│ ├── 12-1-2022.parquet #this file can contain multiple versions
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── actuals-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet

Option 2: Schema Parameters Captured In Directory

In this example, the predictions and actuals are stored in the same file, so they are ingested simultaneously. This example also separates model environments (production and training).
s3://click-thru-rate/
├── production-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── training-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet

Delayed Actuals

If your model receives delayed actuals, upload different files for predictions versus actuals using the same prediction ID, which links your files together in the Arize platform. Arize will check in the same bucket prefix for both predictions and actuals and ingest them separately as they become available.

File Import Model Schema

Model schema parameters are a way of organizing model inference data to ingest to Arize. When configuring your schema, be sure to match your file column headers with the model schema. When configuring the file schema, you can either use a form or a simple JSON-based schema to specify the column mapping.

Example Row

(with explicit feature and tag column prefixes)
prediction_id
prediction_ts
user_id
tags/zone
feature/metropolitan_area
industry
prediction_score
actual_score
prediction_label
actual_label
1fcd50f4689
1637538845
82256
us-east-1
1PA
engineering
0.07773696
0
No Claims
No Claims
(without explicit feature column prefixes for implicit ingestion)
prediction_id
prediction_ts
user_id
tags/zone
metropolitan_area
industry
prediction_score
actual_score
prediction_label
actual_label
1fcd50f4689
1637538845
82256
us-east-1
1PA
engineering
0.07773696
0
No Claims
No Claims

Example Schema

{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit)
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/", # requires explicit column declaration
"shap_values": "shap/", # requires explicit column declaration
"exclude": ["user_id"]
}
Learn how to configure an embedding in the UI here.

Schema Parameters

Property
Description
Required
prediction_ID
The unique identifier of a specific prediction
Required
timestamp
The timestamp of the prediction in seconds or an RFC3339 timestamp
Optional, defaults to current timestamp at file ingestion time
prediction_label
Column name for the prediction value
Required
prediction_score
Column name for the predicted score
Required based on model type
actual_label
Column name for the actual or ground truth value
Optional for production records
relevance_label
Column name for ranking actual or gound truth value
Required for ranking models
actual_score
Column name for the ground truth score
Required based on model type
relevance_score
Column name for ranking ground truth score
Required for ranking models
features
A string prefix to describe a column feature/. Features must be sent in the same file as predictions
Arize automatically infers columns as features. Choose between feature prefixing OR inferred features.
tags
A string prefix to describe a column tag/. Tags must be sent in the same file as predictions and features
Optional
shap_values
A string prefix to describe a column shap/. SHAP must be sent in the same file as predictions or with a matching prediction_id
Optional
version
A column to specify model version. version/ assigns a version to the corresponding data withina column, or configure your version witin the UI
Optional, defaults to 'no_version'
batch_id
Distinguish different batches of data under the same model_id and model_version. Must be specified as a constant during job setup or in the schema
Optional for validation records only
exclude
A list of columns to exclude if the features property is not included in the ingestion schema
Optional
embedding_features
A list of embedding columns, required vector column, optional raw data column, and optional link to data column. Learn more here
Optional

Storage Authorization

Jobs are created and managed per-space basis, so jobs are localized within a workspace. This enables administrators to control model inference access as needed.
For enterprise users, RBAC (role-based access control) controls the write access (file import job access) to models within a workspace. Thus, users who have write access can create new jobs and ingest new models into Arize.
Learn how to set up jobs for each cloud storage provider here.

Programmatic Job Creation

Our public-facing GraphQL API provides a direct path between your cloud storage and the Arize platform to easily set up jobs and automate job creation. The guide below will explain how to use the File Importer API for programmatic job creation.
Questions? Email us at [email protected] or Slack us in the #arize-support channel