LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
      • Replay spans
      • Compare prompts side-by-side
      • Load a dataset into playground
      • Save playground outputs as an experiment
      • ✨Copilot: prompt builder
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Logging Latent Metadata
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Select Amazon S3
  • Enable Access To Individual or Multiple Buckets
  • Define Your Model Schema
  • Validate Your Model Schema
  • Check Job Status
  • Troubleshoot Import Job
  • Applying Bucket Policy & Tag via Terraform

Was this helpful?

  1. Machine Learning
  2. Machine Learning
  3. Integrations: ML

AWS S3

Set up an import job to ingest data into Arize from AWS S3

Last updated 1 year ago

Was this helpful?

If you prefer to use Terraform, jump to Applying Bucket Policy & Tag via Terraform

Set up an import job to log inference files to Arize. Updates to files are checked every 10 seconds. Users generally find a sweet spot around a few hundred thousand to a million rows in each file with the total file limit being 1GB.

Select Amazon S3

Navigate to the 'Upload Data' page on the left navigation bar in the Arize platform. From there, select the 'AWS S3' card to begin a new file import job.

Enable Access To Individual or Multiple Buckets

There are two ways to setup access permissions with Arize

Add File Path

Fill in the file path where you would like Arize to pull your model's inferences. Arize will automatically infer your bucket name and prefix.

Create the path to your bucket and folder to pull your model's inferences. In this example, you might have an AWS bucket and folder named s3://example-demo-bucket/click-thru-rate/production/v1/ that contains parquet files of your model inferences. Your bucket name is example-demo-bucket and your prefix is click-thru-rate/production/v1/.

The file structure can take into consideration various model environments (training, production, etc) and locations of ground truth. In addition, S3 bucket import allows recursive operations. This means that it will include all nested subdirectories within the specified bucket prefix, regardless of the number or depth of these directories.

File Directory Example

There are multiple ways to structure your file directory. If actuals and predictions can be sent together, simply store this data in a the same file and import this data together through a single file importer job.

s3://bucket1/click-thru-rate/production/prediction/
├── 11-19-2022.parquet 
├── 11-20-2022.parquet
├── 11-21-2022.parquet

s3://bucket1/click-thru-rate/production/actuals/
├── 12-1-2022.parquet # same prediction id column, model, and space as the corresponding prediction
├── 12-2-2022.parquet
└── 12-3-2022.parquet

Configure Ingestion Key

In Arize UI: Copy arize-ingestion-key value

In AWS Console: Navigate to your S3 bucket -> Properties -> Edit Bucket Policy

In AWS Console: Set tag Key = arize-ingestion-key and Value as the value copied from Arize UI from the previous step

Enable Bucket Policy Permissions

In Arize UI: Copy the policy supplied by Arize in the file importer job setup

In the AWS console: Navigate to your S3 bucket -> Permission -> Edit Bucket Policy

In the AWS console: Paste the above AWS policy from Arize UI into the bucket policy

Configure Role Based Permissions

If your S3 images are encrypted using the AWS managed KMS encryption key and thus, individual bucket access is not an option, please reach out to support@arize.com for more assistance.

Provide Arize: Provide Arize with the role ARN to access each AWS bucket you want to connect.

In AWS Console: For each role, Arize will assume add the following statement to the role's Trust Policy

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::756106863523:root"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
        "StringEquals": {
            "sts:ExternalId": "<EXTERNAL ID PROVIDED BY ARIZE>"
        }
    }
}

In AWS Console: Add the following statement to the role's Permissions Policies

{
    "Effect": "Allow",
    "Action": [
        "s3:GetBucketTagging",
        "s3:GetObject",
        "s3:ListBucket"
    ],
    "Resource": [
        "arn:aws:s3:::<YOUR BUCKET NAME>",
        "arn:aws:s3:::<YOUR BUCKET NAME>/*"
        ]
    }

Configure Ingestion Key

In Arize UI: Copy arize-ingestion-key value

In AWS Console: Navigate to your S3 bucket -> Properties -> Edit Bucket Policy

In AWS Console: Set tag Key = arize-ingestion-key and Value as the value copied from Arize UI from the previous step

Define Your Model Schema

Model schema parameters are a way of organizing model inference data to ingest to Arize. When configuring your schema, be sure to match your data column headers with the model schema.

You can either use a form or a simple JSON-based schema to specify the column mapping.

Validate Your Model Schema

Once you fill in your applicable predictions, actuals, and model inputs, click 'Validate Schema' to visualize your model schema in the Arize UI. Check that your column names and corresponding data match for a successful import job.

Learn more about Schema fields here.

Once finished, your import job will be created and will start polling your bucket for files.

Check Job Status

Arize will attempt a dry run to validate your job for any access, schema, or record-level errors. If the dry run is successful, you can proceed to create the import job. From there, you will be taken to the 'Job Status' tab.

All active jobs will regularly sync new data from your data source with Arize. You can view the job details by clicking on the job ID, which reveals more information about the job.

To pause, delete, or edit your file schema, click on 'Job Options'.

  • Delete a job if it is no longer needed or if you made an error connecting to the wrong bucket. This will set your job status as 'deleted' in Arize.

  • Pause a job if you have a set cadence to update your table. This way, you can 'start job' when you know there will be new data to reduce query costs. This will set your job status as 'inactive' in Arize.

  • Edit a file schema if you have added, renamed, or missed a column in the original schema declaration.

Troubleshoot Import Job

An import job may run into a few problems. Use the dry run and job details UI to troubleshoot and quickly resolve data ingestion issues.

If there is an error validating a file against the model schema, Arize will surface an actionable error message. From there, click on the 'Fix Schema' button to adjust your model schema.

Dry Run File/Table Passes But The Job Fails

If your dry run is successful, but your job fails, click on the job ID to view the job details. This uncovers job details such as information about the file path or query id, the last import job, potential errors, and error locations.

Once you've identified the job failure point, fix the file errors and reupload the file to Arize with a new name.

Applying Bucket Policy & Tag via Terraform

resource "aws_s3_bucket" "arize-example-bucket" {
  bucket = "my-arize-example-bucket"
  tags = {
    arize-ingestion-key = "value_from_arize_ui"
  }
}

resource "aws_s3_bucket_policy" "grant_arize_read_only_access" {
  bucket = aws_s3_bucket.arize-example-bucket.id
  policy = data.aws_iam_policy_document.grant_arize_read_only_access.json
}

data "aws_iam_policy_document" "grant_arize_read_only_access" {
  statement {
    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::<REDACTED>:role/arize-importer"]
    }

    actions = [
      "s3:GetBucketTagging",
      "s3:GetObject",
      "s3:ListBucket",
    ]

    resources = [
      aws_s3_bucket.arize-example-bucket.arn,
      "${aws_s3_bucket.arize-example-bucket.arn}/*",
    ]
  }
}

In the case of delayed actuals, we recommend you separate your predictions and actuals into separate folders and loading this data through two separate file importer jobs. Learn more .

Tag your bucket with the key arize-ingestion-key and the provided tag value (i.e. ).

If your data contains image embeddings with S3 links to the images, use the method of setting up "Individual Bucket Access" and do not use the Role Based access described here. Visit for information on configuring S3 image access.

Ask Arize: To give Arize access to multiple buckets, ask arize to provide you with an. Reach out to support@arize.com for assistance.

Tag your bucket with the key arize-ingestion-key and the provided tag value (i.e. ).

Arize supports CSV, Parquet, Avro, and Apache Arrow. Refer for a list of the expected data types by input type.

Learn more about Schema fields .

If your model receives delayed actuals, connect your predictions and actuals using the same prediction ID, which links your data together in the Arize platform. Arize regularly checks your data source for both predictions and actuals, and ingests them separately as they become available. Learn more .

📈

Configure An Individual Bucket Policy

Give Arize permission to access individual buckets

Configure Multiple Buckets Via Role Based Permissions

Assign Arize a role to access multiple buckets using external IDs

here
AWS Object Tags
here
External ID
AWS Object Tags
here
here
here
Step 1: Select Amazon S3
Example File Path In Arize UI
Example Bucket Tag In Arize UI
Navigate to your bucket properties tab
Add arize-ingestion-key as a bucket tag
Copy AWS Policy In Arize UI
Add/Edit bucket policy
Add policy to your bucket
File Schema Form Inputs
Form Schema JSON Inputs
Job status page showing job listings