Search
⌃K
Links

AWS S3

Set up an import job to ingest data into Arize from AWS S3
If you prefer to use Terraform, jump to Applying Bucket Policy & Tag via Terraform
Set up an import job to log inference files to Arize. Updates to files are checked every 10 seconds. Users generally find a sweet spot around a few hundred thousand to a million rows in each file with the total file limit being 1GB.

Select Amazon S3

Navigate to the 'Upload Data' page on the left navigation bar in the Arize platform. From there, select the 'AWS S3' card to begin a new file import job.
Step 1: Select Amazon S3

Enable Access To Individual or Multiple Buckets

There are two ways to setup access permissions with Arize
Configure An Individual Bucket Policy
Configure Multiple Buckets Via Role Based Permissions

Add File Path

Fill in the file path where you would like Arize to pull your model's inferences. Arize will automatically infer your bucket name and prefix.
Example File Path In Arize UI
Create the path to your bucket and folder to pull your model's inferences. In this example, you might have an AWS bucket and folder named s3://example-demo-bucket/click-thru-rate/production/v1/ that contains parquet files of your model inferences. Your bucket name is example-demo-bucket and your prefix is click-thru-rate/production/v1/.
The file structure can take into consideration various model environments (training, production, etc) and locations of ground truth.
Example 1: Predictions & Actuals Stored in Separate Folders (different prefixes)
This example contains model predictions and actuals in separate files, this helps in cases of delayed actuals. Learn more here.
s3://example-demo-bucket/click-thru-rate/production/
├── prediction-folder/
│ ├── 12-1-2022.parquet #this file can contain multiple versions
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── actuals-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet
Example 2: Production & Training Stored in Separate Folders
This example separates model environments (production and training).
s3://bucket-1/click-thru-rate/v1/
├── production-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ ├── 12-3-2022.parquet
├── training-folder/
│ ├── 12-1-2022.parquet
│ ├── 12-2-2022.parquet
│ └── 12-3-2022.parquet

Configure Ingestion Key

Tag your bucket with the key arize-ingestion-key and the provided tag value (i.e. AWS Object Tags).
In Arize UI: Copy arize-ingestion-key value
Example Bucket Tag In Arize UI
In AWS Console: Navigate to your S3 bucket -> Properties -> Edit Bucket Policy
Navigate to your bucket properties tab
In AWS Console: Set tag Key = arize-ingestion-key and Value as the value copied from Arize UI from the previous step
Add arize-ingestion-key as a bucket tag

Enable Bucket Policy Permissions

In Arize UI: Copy the policy supplied by Arize in the file importer job setup
Copy AWS Policy In Arize UI
In the AWS console: Navigate to your S3 bucket -> Permission -> Edit Bucket Policy
Add/Edit bucket policy
In the AWS console: Paste the above AWS policy from Arize UI into the bucket policy
Add policy to your bucket

Configure Role Based Permissions

Ask Arize: To give Arize access to multiple buckets, ask arize to provide you with an External ID. Reach out to [email protected] for assistance.
Provide Arize: Provide Arize with the role ARN to access each AWS bucket you want to connect.
In AWS Console: For each role, Arize will assume add the following statement to the role's Trust Policy
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::756106863523:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<EXTERNAL ID PROVIDED BY ARIZE>"
}
}
}
In AWS Console: Add the following statement to the role's Permissions Policies
{
"Effect": "Allow",
"Action": [
"s3:GetBucketTagging",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<YOUR BUCKET NAME>",
"arn:aws:s3:::<YOUR BUCKET NAME>/*"
]
}

Configure Ingestion Key

Tag your bucket with the key arize-ingestion-key and the provided tag value (i.e. AWS Object Tags).
In Arize UI: Copy arize-ingestion-key value
In AWS Console: Navigate to your S3 bucket -> Properties -> Edit Bucket Policy
In AWS Console: Set tag Key = arize-ingestion-key and Value as the value copied from Arize UI from the previous step

Define Your Model Schema

Model schema parameters are a way of organizing model inference data to ingest to Arize. When configuring your schema, be sure to match your data column headers with the model schema.
You can either use a form or a simple JSON-based schema to specify the column mapping.
Arize supports CSV, Parquet, Avro, and Apache Arrow. Refer here for a list of the expected data types by input type.
File Schema Form Inputs
Form Schema JSON Inputs
Property
Description
Required
prediction_ID
The unique identifier of a specific prediction. Limited to 128 characters.
Required
timestamp
The timestamp of the prediction in seconds or an RFC3339 timestamp
Optional, defaults to current timestamp at file ingestion time
prediction_label
Column name for the prediction value
Required based on model type
prediction_score
Column name for the predicted score
Required based on model type
actual_label
Column name for the actual or ground truth value
Optional for production records
actual_score
Column name for the ground truth score
Required based on model type
prediction_group_id
Column name for ranking groups or lists in ranking models
Required for ranking models
rank
Column name for rank of each element on the its group or list
Required for ranking models
relevance_label
Column name for ranking actual or ground truth value
Required for ranking models
relevance_score
Column name for ranking ground truth score
Required for ranking models
features
A string prefix to describe a column feature/. Features must be sent in the same file as predictions
Arize automatically infers columns as features. Choose between feature prefixing OR inferred features.
tags
A string prefix to describe a column tag/. Tags must be sent in the same file as predictions and features
Optional
shap_values
A string prefix to describe a column shap/. SHAP must be sent in the same file as predictions or with a matching prediction_id
Optional
version
A column to specify model version. version/ assigns a version to the corresponding data within a column, or configure your version within the UI
Optional, defaults to 'no_version'
batch_id
Distinguish different batches of data under the same model_id and model_version. Must be specified as a constant during job setup or in the schema
Optional for validation records only
exclude
A list of columns to exclude if the features property is not included in the ingestion schema
Optional
embedding_features
A list of embedding columns, required vector column, optional raw data column, and optional link to data column. Learn more here
Optional

Validate Your Model Schema

Once you fill in your applicable predictions, actuals, and model inputs, click 'Validate Schema' to visualize your model schema in the Arize UI. Check that your column names and corresponding data match for a successful import job.
Once finished, your import job will be created and will start polling your bucket for files.
If your model receives delayed actuals, connect your predictions and actuals using the same prediction ID, which links your data together in the Arize platform. Arize regularly checks your data source for both predictions and actuals, and ingests them separately as they become available. Learn more here.

Check Your File Import Job

Arize will attempt a dry run to validate your job for any access, schema, or record-level errors. If the dry run is successful, you may then create the import job.
After creating a job following a successful dry run, you will be taken to the 'Job Status' tab where you can see the status of your import jobs. A created job will regularly sync new data from your data source with Arize. You can view the job details and import progress by clicking on the job ID, which uncovers more information about the job.

Troubleshoot Import Job

An import job may run into a few problems. Use the dry run and job details UI to troubleshoot and quickly resolve data ingestion issues.

Validation Errors

If there is an error validating a file against the model schema, Arize will surface an actionable error message. From there, click on the 'Fix Schema' button to adjust your model schema.

Dry Run File/Table Passes But The Job Fails

If your dry run is successful, but your job fails, click on the job ID to view the job details. This uncovers job details such as information about the file path or query id, the last import job, potential errors, and error locations.
Once you've identified the job failure point, fix the file errors and reupload the file to Arize with a new name.

Applying Bucket Policy & Tag via Terraform

resource "aws_s3_bucket" "arize-example-bucket" {
bucket = "my-arize-example-bucket"
tags = {
arize-ingestion-key = "value_from_arize_ui"
}
}
resource "aws_s3_bucket_policy" "grant_arize_read_only_access" {
bucket = aws_s3_bucket.arize-example-bucket.id
policy = data.aws_iam_policy_document.grant_arize_read_only_access.json
}
data "aws_iam_policy_document" "grant_arize_read_only_access" {
statement {
principals {
type = "AWS"
identifiers = ["arn:aws:iam::<REDACTED>:role/arize-importer"]
}
actions = [
"s3:GetBucketTagging",
"s3:GetObject",
"s3:ListBucket",
]
resources = [
aws_s3_bucket.arize-example-bucket.arn,
"${aws_s3_bucket.arize-example-bucket.arn}/*",
]
}
}