Sending Data FAQ

Delayed Actuals

What happens when you have two different models with the same set of prediction IDs?

When sending delayed actuals, specify the model_id in your schema to match your actuals to the correct model.

Does the Arize Platform look at specific model versions?

Delayed actuals are mapped back to predictions via a model_id and prediction_id, regardless of version. This means that if you have the same prediction_id in multiple model versions, the actual will be joined to each row with the matching prediction_id.

What happens after I send in actual data?

If you send actuals to Arize to log delayed actuals (when preexisting predictions already exist in Arize), Arize will join the delayed actuals with the correlating prediction IDs in the platform at 5 AM UTC daily.

However, if you have never logged predictions for your model, you must upload prediction values corresponding to your actuals (using the same prediction ID) to view your model in Arize.

What if I have actuals after the 14-day join window?

The Arize engineering support team can extend your connection windows. Reach out to for help with this.

The configurations are applied at the space level. All models in the space will receive the same join window configuration.

What happens in the case of partial actuals and null rows, and how does that impact my performance metric?

Arize only calculates performance metrics on predictions that have actuals. Null rows are ignored when calculating performance metrics. If actuals have not been received yet (delayed actuals), refer to our support information on how Arize handles nulls for your use case.

Timestamps Q&A

What does the data ingestion date mean?

The data ingestion date is the timestamp for when the data was received by Arize. The data ingestion tab in the Arize UI shows the real-time data ingestion stats daily.

Why should I use a prediction timestamp?

The prediction timestamp represents when your model's prediction was made (time of inference). This timestamp is a column of data sent to Arize with the model inference data. Arize time series charts are based on the prediction timestamp.

How should I format my prediction timestamp?

Your timestamp can be in seconds or an RFC3339 timestamp. If you do not specify a timestamp, your timestamp will default to the time of file ingestion.

Can I extend the prediction timestamp window?

Arize supports sending in historical data with prediction timestamps up to 2 years before the current timestamp. The data sent to Arize will be retained and visible in Arize for up to 2 years. Reach out to support@arize.com to extend this window.

Data Connector Q&A

What is the row limit for a single upload?

The current cap is 40 million rows; we recommend splitting the data into multiple files for larger uploads.

What is the advised file size to optimize upload performance?

While our system has the capacity to accommodate files up to 1GB, we find that files around the 50MB mark usually provide the best balance between volume of data and system performance. This size typically represents less than a million rows of data.

How do I resolve timestamp issues?

If the prediction timestamp column isn't correctly set, import jobs may result in parsing errors and fail. To make sure this doesn't happen, ensure that:

the timestamp format is in seconds (not something more granular) or RFC3339
the timestamp is within a year of today's date (either past or future)

How do I resolve file schema issues?

Ensure Training and Validation records must include both prediction and actual columns. Otherwise, it will result in a data validation error.

How do I resolve data type issues?

If the data type expected is numeric, but comes in as a string

Ensure there are no string values in numeric columns
If None or Null values are used to represent empty numerics, represent them instead as NaN

Which data connector is right for me?

Arize connectors are classified into two key categories: Data Warehouses/Lakes and Object Stores that fulfill distinctive data management and analytical needs.

Data Warehouse/Lakes(Snowflake, GBQ, Databricks): Used for real-time analysis of high-volume datasets. They are the prime choice for quick, continual, and comprehensive data exploration.
Object Store(GCS, AWS S3, Azure): Designed for reliable and cost-effective storage of substantial data quantities, these connectors are ideal for backup and archival purposes. They are the go-to option when long-term data preservation and accessibility are important.

File Types Q&A

Can I upload embeddings, list_of_string, or ranking data in a CSV file?

If your embeddings, list_of_string, or ranking data are part of a CSV file, convert the data to Parquet before uploading your data to Arize. To convert your file:

import pandas as pd
import math

"""
parse_vector is a helper function that converts a string representation 
of an embeddings vector to a list. Modify the separator and delimiters per the user's vector representation

Example of a string vector representation: 
'[float1, float2, float3]'
"""
def parse_vector(v, sep=',', delim='[]'):    
    if not isinstance(v, str) and math.isnan(v):
        return v
    if isinstance(v, str) and v.lower() == "nan":
        return float('nan')
    
    v_list = v.strip(delim).replace('\n','').split(sep)
    return [float(k.strip(" ")) for k in v_list]

df = pd.read_csv("<file name>.csv")

df[<vector column>] = df.apply(lambda x: parse_vector(x[<vector column>]), axis=1)

df.to_parquet("<file name>.parquet")

What should a CSV file look like?

The contents within a can file vary based on model type - with the exception of required fields (ie prediction_id).

When configuring a model schema, what are the expected input types based on my data format?

Input Data Field

Data Type

prediction_id

string

prediction_label/actual_label

string / int / float

prediction_score/actual_score

int / float

timestamp

int / float(number of seconds since unix epoch)
string RFC3339 format (e.g 2022-04-16T16:15:00Z)

features

string
int
float

tags

string

shap_values

int
float

Input Data Field

Parquet Data Type

prediction_id

string
int8 , int16, int32, int64

prediction_label/actual_label

string
boolean
int8, int16, int32, int64
float32, float64

prediction_score/actual_score

int8, int16, int32, int64
float32, float64

timestamp

int64, float64 (number of seconds since unix epoch)
string RFC3339 format (e.g 2022-04-16T16:15:00Z)
timestamp
date32, date64

features/tags

string
boolean (converted to string)
int8, int16, int32, int64
float32, float64
decimal128 (round to 5 decimal places)
date32, date64, timestamp (converted to integer)
list of strings

shap_values

int32, int64
float32, float64

embedding_feature:vector

list of {int8|int16|int32|int64|float32|float64}

embedding_feature:raw_data

string
list of strings

embedding_feature:link_to_data

string

ranking:prediction_group_id

string
int8 , int16, int32, int64

ranking:rank

int8, int16, int32, int64

ranking:category

array of strings

ranking:relevance_score

int8, int16, int32, int64
float32, float32

Use the Avro schema embedded in the header of the Avro Object Container File(OCF) to decode and match to those specified in Arize file importer schema for data ingestion. The field name in the OCF file must

start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]

Input Data Field

Avro Data Type

prediction_id

long, int, string

prediction_label/actual_label

string
boolean, int, long, float, double, enum (will be converted to string)

prediction_score/actual_score

int, long, float, double

timestamp

long, double (number of seconds since unix epoch)
string RFC3339 format (e.g 2022-04-16T16:15:00Z)
timestamp logical type
date logical type

features/tags

string
enum , boolean (will be converted to string)
int, long
float, double
array of strings(will be converted to list of strings)

shap_values

int , long, float, double

embedding_feature:vector

array of {int|long|float|double}

embedding_feature:raw_data

string
array of strings

embedding_feature:link_to_data

string

ranking:prediction_group_id

long, int, string

ranking:rank

int, long

ranking:category

array of strings

ranking:relevance_score

int , long, float, double

This example shows what an Arrow file columns and schema file would look like.

The "*" can be used to add features to a file without changing the schema

Column Name in File

Arize Schema

my-prediction-ts

prediction_timestamp

my-prediction-id-customer

prediction_id

my-prediction-score

prediction_score

my-prediction-label

prediction_label

my-feature.addr_state

features

my-feature.revenue

features

my-environment

environment

my-actual-label

actual_label

Note the name "my-feature" has multiple feature values.

ModelSchema:
   prediction_timestamp: "my-prediction_ts"
   prediction_id: "my-prediction-id-customer"
   prediction_score: "my-prediction-score"
   prediction_label: "my-prediction-label"
   features: "feature.*" # describes the path to the "features" object above, containing "addr_state" and "revenue"

Input Data Field

GBQ Data Type

prediction_id

INT64, STRING

change_timestamp

TIMESTAMP

prediction_label/actual_label

STRING
BOOL, INT64, NUMERIC, FLOAT64

prediction_score/actual_score

INT64, NUMERIC, FLOAT64

timestamp

INT64, FLOAT64 (number of seconds since unix epoch)
STRING RFC3339 format (e.g 2022-04-16T16:15:00Z)
TIMESTAMP

features/tags

STRING
BOOL (will be converted to string)
INT64
FLOAT64, NUMERIC

shap_values

INT64 , FLOAT64, NUMERIC

embedding_feature:vector

ARRAY of {INT64, FLOAT64, NUMERIC}

embedding_feature:raw_data

STRING
ARRAY of STRING

embedding_feature:link_to_data

STRING

ranking:prediction_group_id

INT64, STRING

ranking:rank

INT64

ranking:category

ARRAY of STRING

ranking:relevance_score

INT64, FLOAT64, NUMERIC

Installing the arize[MimicExplainer]

I get "no matches found" error

I get this error when I try pip install arize[MimicExplainer]:

no matches found: arize[MimicExplainer]

Some shells (zsh) may interpret the brackets in a special way. In order to get around this, you may need to escape the brackets with a backslack: !! pip install arize\[MimicExplainer\].

I get a verbose error when trying to install arize[MimicExplainer]

I get this error when pip is installing LightGBM (a dependency of the MimicExplainer):

INFO:LightGBM:Starting to compile with CMake.
...
FileNotFoundError: [Errno 2] No such file or directory: 'cmake'

subprocess.CalledProcessError: Command '['make', '_lightgbm', ...]' returned non-zero exit status 2."

This is because pip is attempting to compile lib_lightgbm - the C library for LightGBM. The compile process needs cmake as well as Open MP. To install this on a MacOS, try this first:

brew install cmake libomp

Data Sources Q&A

What do example rows look like in a file?

Example Row

(with explicit feature and tag column prefixes)

prediction_id

prediction_ts

user_id

tags/zone

feature/metropolitan_area

industry

prediction_score

actual_score

prediction_label

actual_label

1fcd50f4689

1637538845

82256

us-east-1

1PA

engineering

0.07773696

No Claims

(without explicit feature column prefixes for implicit ingestion)

prediction_id

prediction_ts

user_id

tags/zone

metropolitan_area

industry

prediction_score

actual_score

prediction_label

actual_label

1fcd50f4689

1637538845

82256

us-east-1

1PA

engineering

0.07773696

No Claims

Example Schema

{
  "prediction_id": "prediction_id",
  "timestamp": "prediction_ts",
  "features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit) 
  "prediction_score": "prediction_score",
  "prediction_label": "prediction_label",
  "actual_score": "actual_score",
  "actual_label": "actual_label",
  "tags": "tag/", # requires explicit column declaration
  "shap_values": "shap/", # requires explicit column declaration
  "exclude": ["user_id"]
}

Python SDK Q&A

Did Arize receive my prediction/actual record?

When you log prediction data into Arize using the Python SDK, there are two ways to check the status depending on which mode is chosen.

Batch: The pandas logger named arize.pandas returns a response so you can check the status with response.status_code

Real-time: The logger named arize.log returns a future . See example:

import concurrent.futures as cf

def arize_responses_helper(responses):
    """
    responses: a list of responses from Arize
    returns: None
    """
    for response in cf.as_completed(responses):
        res = response.result()
        if res.status_code != 200:
            raise ValueError(f'failed with code {res.status_code}, {res.text}')

# Logging to Arize, returns a list of responses
responses = arize.log(...) # your log call
# Check responses!          
arize_responses_helper(responses)

After receiving a 200 response code, head over to your model's Data Ingestion Tab to confirm that Arize has received your data.

The model's inferences are indexed by the received timestamp, NOT the timestamp of the inferences.

What happens if I upload the same data with the same prediction ID twice? Does Arize treat that as one prediction/observation or as two?

They are treated as separate observations. This would mean that 2 predictions sent with the same prediction ID would count as 2 predictions in the platform. If there was an actual sent for both predictions, it would show up as 2 separate predictions with both having a corresponding matching actual.

What are the Supported Data Types for the Python SDK?

We currently support the following data types for the corresponding columns.

Column Type

Supported Data Types

Features

int, float, str, bool

Prediction ID

int, str

Prediction Timestamps

int, float, date, datetime

Prediction Score

int, float

Actual Score

int, float

SHAP values

int, float

Supported data types for Prediction and Actual labels and scores depend on the model types.

Column Type

Score Categorical

Numeric

Prediction Label

str

int, float

Actual Label

str

int, float

Prediction Score

int, float

Actual Score

int, float

Actual Numeric Sequence

List[int, float]

How can I ensure that consistent types are logged across multiple SDK calls?

Arize's Python SDK offers optional type configurations for features and tags in both dataframes and single value records.

Single-Record Logging Example:

# TypedValue: 
#   An object that can be passed into the features or tags dict.
#   Specifies the type that the value should be cast to. 

# Cast the 'distance' and 'purchased' values, and leave the 'age' value alone.
features = {
    'age': 49,
    'distance': TypedValue(value=8, type=ArizeTypes.FLOAT), 
    'purchased': TypedValue(value=bool(1), type=ArizeTypes.INT),  
}
response = arize.log(
    model_id='sample-model-1', 
    model_version='v1', 
    ...
    tags=tags,
)

Batch Logging Example:

# TypedColumns:
#   An object that can be passed into the Schema in place of 
#   a list of feature or tag column names.
#   Columns in the 'inferred' list are ingested as-is, and
#   the other columns are ingested as nullable columns of the specified type. 
# Note: Requires pandas version 1.0.0 or higher,
#   and uses pd.StringDType, still considered experimental.

schema = Schema(
    prediction_id_column_name='prediction_id', 
    ...
    tag_column_names=['location', 'month', 'fruit', 'count'], # columns ingested as-is
    feature_column_names=TypedColumns(
        inferred=['age'],         # columns ingested as-is
        to_float=['distance'],    # columns cast to specified type
        to_str=['purchased', 'country'],
    ),
)

Null Values Q&A

How does Arize handle missing predictions or actual columns?

Arize requires at least one prediction, actual, or feature importance column, but if all are missing, Arize will reject the dataset.

How does Arize handle null values in a prediction or actual column?

Arize generally accepts null values within prediction and actual columns. In order to successfully ingest null values while still constructing valid records in Arize, each row much contain least one non-null prediction, actual, or feature importance value.

For example, in the case of a regression model, it's common for actuals (ground truth) to miss values. In this case, the actuals column will include null values and non-null values. For the rows where an actual is null, a prediction value or feature importance value is required to construct a valid record.

What if my features or tags have None/NaN/Inf values?

They are accepted and treated as empty.

What if I have null values in my Prediction ID column?

If you include a prediction_id column with null values within the column, Arize will reject the column.

Last updated 1 year ago

Was this helpful?

import pandas as pd import math """ parse_vector is a helper function that converts a string representation of an embeddings vector to a list. Modify the separator and delimiters per the user's vector representation Example of a string vector representation: '[float1, float2, float3]' """ def parse_vector(v, sep=',', delim='[]'): if not isinstance(v, str) and math.isnan(v): return v if isinstance(v, str) and v.lower() == "nan": return float('nan') v_list = v.strip(delim).replace('\n','').split(sep) return [float(k.strip(" ")) for k in v_list] df = pd.read_csv("<file name>.csv") df[<vector column>] = df.apply(lambda x: parse_vector(x[<vector column>]), axis=1) df.to_parquet("<file name>.parquet")

ModelSchema: prediction_timestamp: "my-prediction_ts" prediction_id: "my-prediction-id-customer" prediction_score: "my-prediction-score" prediction_label: "my-prediction-label" features: "feature.*" # describes the path to the "features" object above, containing "addr_state" and "revenue"

{ "prediction_id": "prediction_id", "timestamp": "prediction_ts", "features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit) "prediction_score": "prediction_score", "prediction_label": "prediction_label", "actual_score": "actual_score", "actual_label": "actual_label", "tags": "tag/", # requires explicit column declaration "shap_values": "shap/", # requires explicit column declaration "exclude": ["user_id"] }

import concurrent.futures as cf def arize_responses_helper(responses): """ responses: a list of responses from Arize returns: None """ for response in cf.as_completed(responses): res = response.result() if res.status_code != 200: raise ValueError(f'failed with code {res.status_code}, {res.text}') # Logging to Arize, returns a list of responses responses = arize.log(...) # your log call # Check responses! arize_responses_helper(responses)

# TypedValue: # An object that can be passed into the features or tags dict. # Specifies the type that the value should be cast to. # Cast the 'distance' and 'purchased' values, and leave the 'age' value alone. features = { 'age': 49, 'distance': TypedValue(value=8, type=ArizeTypes.FLOAT), 'purchased': TypedValue(value=bool(1), type=ArizeTypes.INT), } response = arize.log( model_id='sample-model-1', model_version='v1', ... tags=tags, )

# TypedColumns: # An object that can be passed into the Schema in place of # a list of feature or tag column names. # Columns in the 'inferred' list are ingested as-is, and # the other columns are ingested as nullable columns of the specified type. # Note: Requires pandas version 1.0.0 or higher, # and uses pd.StringDType, still considered experimental. schema = Schema( prediction_id_column_name='prediction_id', ... tag_column_names=['location', 'month', 'fruit', 'count'], # columns ingested as-is feature_column_names=TypedColumns( inferred=['age'], # columns ingested as-is to_float=['distance'], # columns cast to specified type to_str=['purchased', 'country'], ), )