Search…
⌃K
Links

Sending Data FAQ

Delayed Actuals Q&A

What if my ground truth/actuals are delayed after I send my predictions to Arize?

Depending on your model use case, you may experience a delayed feedback loop when collecting ground truth data (actuals). If your actuals are delayed, Arize supports ingesting delayed actual data by automatically connecting actuals to predictions via the same prediction ID.
To ingest delayed actuals, prediction_ID and actuals are required fields. Example file import schema with delayed actuals:
{
"prediction_id": "prediction_id",
...
"prediction_score": "prediction_score", //scores or labels vary based on the model type
"prediction_label": "prediction_label",
}
​
// logging delayed actuals
​
{
"prediction_id": "prediction_id", //must match previously ingested prediction
...
"actual_score": "actual_score", //scores or labels vary based on the model type
"actual_label": "actual_label",
}
Refer here for an example in the Python SDK.
Arize connects actuals to predictions up to a 14-day window from when the prediction was received. Read here for more information.

Can I log predictions and actuals in different files using the file importer?

Yes, Arize automatically connects files using the same prediction ID.

What happens when you have two different models with the same set of prediction IDs?

When sending delayed actuals, specify the model_id in your schema to match your actuals to the correct model.

Does the Arize Platform look at specific model versions?

Delayed actuals are mapped back to predictions via a model_id and prediction_id, regardless of version. This means that if you have the same prediction_id in multiple model versions, the actual will be joined to each row with the matching prediction_id.

How long does it take to connect delayed actuals in the platform?

Arize connects delayed actuals with prediction IDs in the platform at 5 AM UTC daily.

What happens after I send in actual data?

If you send actuals to Arize to log delayed actuals (when preexisting predictions already exist in Arize), Arize will join the delayed actuals with the correlating prediction IDs in the platform at 5 AM UTC daily.
However, if you have never logged predictions for your model, you must upload prediction values corresponding to your actuals (using the same prediction ID) to view your model in Arize.

How does Arize handle tags with delayed actuals?

Tags can be updated via delayed actuals. If tags are sent with actuals, the tag will be joined based on prediciton_id. However, if the actual was sent prior, then resent with an updated tag value, the tag value will not be updated. Tags will remain the same if they are sent with predictions and not actuals.
For example, if a user sends Arize a prediction with tags:
"location": "New York"
"month": "January"
And actual with tags:
"location": "Chicago"
"fruit": "apple"
The resulting tags available will be:
"location": "New York"
"month": "January"
"fruit": "apple"

If I upload actuals twice, will it overwrite the existing actuals?

Arize checks if an actual value exists for a given prediction_idand drops the new actual if a value already exists. Reach out to [email protected] if you have questions.

What if I have actuals after the 14-day join window?

The Arize engineering support team can extend your connection windows. Reach out to [email protected] for help with this.
The configurations are applied at the space level. All models in the space will receive the same join window configuration.

What happens in the case of partial actuals and null rows, and how does that impact my performance metric?

Arize only calculates performance metrics on predictions that have actuals. Null rows are ignored when calculating performance metrics. If actuals have not been received yet (delayed actuals), refer to our default actual support information on how Arize handles nulls for your use case.

Timestamps Q&A

What does the data ingestion date mean?

The data ingestion date is the timestamp for when the data was received by Arize. The data ingestion tab in the Arize UI shows the real-time data ingestion stats daily.

Why should I use a prediction timestamp?

The prediction timestamp represents when your model's prediction was made (time of inference). This timestamp is a column of data sent to Arize with the model inference data. Arize time series charts are based on the prediction timestamp.

How should I format my prediction timestamp?

Your timestamp can be in seconds or an RFC3339 timestamp. If you do not specify a timestamp, your timestamp will default to the time of file ingestion.

Can I extend the prediction timestamp window?

Arize supports sending in historical data with prediction timestamps up to 1 year before your current timestamp. However, the data sent to Arize will be retrained and visible in Arize for up to 2 years. Reach out to [email protected] to extend this window.

Data Connector Q&A

Resolving Common Issues

How do I resolve timestamp issues?

If the prediction timestamp column isn't correctly set, import jobs may result in parsing errors and fail. To make sure this doesn't happen, ensure that:
  • the timestamp format is in seconds (not something more granular) or RFC3339
  • the timestamp is within a year of today's date (either past or future)

How do I resolve file schema issues?

Ensure Training and Validation records must include both prediction and actual columns. Otherwise, it will result in a data validation error.

How do I resolve data type issues?

If the data type expected is numeric, but comes in as a string
  • Ensure there are no string values in numeric columns
  • If None or Null values are used to represent empty numerics, represent them instead as NaN
Learn more about data types here.

File Types Q&A

Can I upload embeddings or ranking data in a CSV file?

If your embeddings or ranking data are part of a CSV file, convert the data to Parquet before uploading your data to Arize. To convert your file:
import pandas as pd
import math
​
"""
parse_vector is a helper function that converts a string representation
of an embeddings vector to a list. Modify the separator and delimiters per the user's vector representation
​
Example of a string vector representation:
'[float1, float2, float3]'
"""
def parse_vector(v, sep=',', delim='[]'):
if not isinstance(v, str) and math.isnan(v):
return v
if isinstance(v, str) and v.lower() == "nan":
return float('nan')
v_list = v.strip(delim).replace('\n','').split(sep)
return [float(k.strip(" ")) for k in v_list]
​
df = pd.read_csv("<file name>.csv")
​
df[<vector column>] = df.apply(lambda x: parse_vector(x[<vector column>]), axis=1)
​
df.to_parquet("<file name>.parquet")
​

What should a CSV file look like?

The contents within a can file vary based on model type - with the exception of required fields (ie prediction_id).
Predictions Only

When configuring a model schema, what are the expected input types based on my data format?

CSV
Parquet
Avro
Apache Arrow
BigQuery
Input Data Field
Data Type
prediction_id
string
prediction_label/actual_label
string / int / float
prediction_score/actual_score
int / float
timestamp
  • int / float(number of seconds since unix epoch)
  • string RFC3339 format (e.g 2022-04-16T16:15:00Z)
features
  • string
  • int
  • float
tags
  • string
shap_values
  • int
  • float
Input Data Field
Parquet Data Type
prediction_id
  • string
  • int8 , int16, int32, int64
prediction_label/actual_label
  • string
  • boolean
  • int8, int16, int32, int64
  • float32, float64
prediction_score/actual_score
  • int8, int16, int32, int64
  • float32, float64
timestamp
  • int64, float64 (number of seconds since unix epoch)
  • string RFC3339 format (e.g 2022-04-16T16:15:00Z)
  • timestamp
  • date32, date64
features/tags
  • string
  • boolean (converted to string)
  • int8, int16, int32, int64
  • float32, float64
  • decimal128 (round to 5 decimal places)
  • date32, date64, timestamp (converted to integer)
shap_values
  • int32, int64
  • float32, float64
embedding_feature:vector
list of {int8|int16|int32|int64|float32|float64}
embedding_feature:raw_data
  • string
  • list of strings
embedding_feature:link_to_data
string
ranking:prediction_group_id
  • string
  • int8 , int16, int32, int64
ranking:rank
int8, int16, int32, int64
ranking:category
array of strings
ranking:relevance_score
  • int8, int16, int32, int64
  • float32, float32
Use the Avro schema embedded in the header of the Avro Object Container File(OCF) to decode and match to those specified in Arize file importer schema for data ingestion. The field name in the OCF file must
  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]
Input Data Field
Avro Data Type
prediction_id
long, int, string
prediction_label/actual_label
  • string
  • boolean, int, long, float, double, enum (will be converted to string)
prediction_score/actual_score
int, long, float, double
timestamp
  • long, double (number of seconds since unix epoch)
  • string RFC3339 format (e.g 2022-04-16T16:15:00Z)
  • timestamp logical type
  • date logical type
features/tags
  • string
  • enum , boolean (will be converted to string)
  • int, long
  • float, double
shap_values
int , long, float, double
embedding_feature:vector
array of {int|long|float|double}
embedding_feature:raw_data
  • string
  • array of strings
embedding_feature:link_to_data
string
ranking:prediction_group_id
long, int, string
ranking:rank
int, long
ranking:category
array of strings
ranking:relevance_score
int , long, float, double
This example shows what an Arrow file columns and schema file would look like.
The "*" can be used to add features to a file without changing the schema
Column Name in File
Arize Schema
my-prediction-ts
prediction_timestamp
my-prediction-id-customer
prediction_id
my-prediction-score
prediction_score
my-prediction-label
prediction_label
my-feature.addr_state
features
my-feature.revenue
features
my-environment
environment
my-actual-label
actual_label
Note the name "my-feature" has multiple feature values.
ModelSchema:
prediction_timestamp: "my-prediction_ts"
prediction_id: "my-prediction-id-customer"
prediction_score: "my-prediction-score"
prediction_label: "my-prediction-label"
features: "feature.*" # describes the path to the "features" object above, containing "addr_state" and "revenue"
See here for definition of BigQuery types.
Input Data Field
GBQ Data Type
prediction_id
INT64, STRING
change_timestamp
TIMESTAMP
prediction_label/actual_label
  • STRING
  • BOOL, INT64, NUMERIC, FLOAT64
prediction_score/actual_score
INT64, NUMERIC, FLOAT64
timestamp
  • INT64, FLOAT64 (number of seconds since unix epoch)
  • STRING RFC3339 format (e.g 2022-04-16T16:15:00Z)
  • TIMESTAMP
features/tags
  • STRING
  • BOOL (will be converted to string)
  • INT64
  • FLOAT64, NUMERIC
shap_values
INT64 , FLOAT64, NUMERIC
embedding_feature:vector
ARRAY of {INT64, FLOAT64, NUMERIC}
embedding_feature:raw_data
  • STRING
  • ARRAY of STRING
embedding_feature:link_to_data
STRING
ranking:prediction_group_id
INT64, STRING
ranking:rank
INT64
ranking:category
ARRAY of STRING
ranking:relevance_score
INT64, FLOAT64, NUMERIC

Installing the arize[MimicExplainer]

I get "no matches found" error

I get this error when I try pip install arize[MimicExplainer]:
no matches found: arize[MimicExplainer]
Some shells (zsh) may interpret the brackets in a special way. In order to get around this, you may need to escape the brackets with a backslack: !! pip install arize\[MimicExplainer\].

I get a verbose error when trying to install arize[MimicExplainer]

I get this error when pip is installing LightGBM (a dependency of the MimicExplainer):
INFO:LightGBM:Starting to compile with CMake.
...
FileNotFoundError: [Errno 2] No such file or directory: 'cmake'
or
subprocess.CalledProcessError: Command '['make', '_lightgbm', ...]' returned non-zero exit status 2."
This is because pip is attempting to compile lib_lightgbm - the C library for LightGBM. The compile process needs cmake as well as Open MP. To install this on a MacOS, try this first:
brew install cmake libomp

Data Sources Q&A

What do example rows look like in a file?

Example Row

(with explicit feature and tag column prefixes)
prediction_id
prediction_ts
user_id
tags/zone
feature/metropolitan_area
industry
prediction_score
actual_score
prediction_label
actual_label
1fcd50f4689
1637538845
82256
us-east-1
1PA
engineering
0.07773696
0
No Claims
No Claims
(without explicit feature column prefixes for implicit ingestion)
prediction_id
prediction_ts
user_id
tags/zone
metropolitan_area
industry
prediction_score
actual_score
prediction_label
actual_label
1fcd50f4689
1637538845
82256
us-east-1
1PA
engineering
0.07773696
0
No Claims
No Claims

Example Schema

{
"prediction_id": "prediction_id",
"timestamp": "prediction_ts",
"features": "feature/", # omit this row and feature/ column label prefix for implicit ingestion (must pick explicit or implicit)
"prediction_score": "prediction_score",
"prediction_label": "prediction_label",
"actual_score": "actual_score",
"actual_label": "actual_label",
"tags": "tag/", # requires explicit column declaration
"shap_values": "shap/", # requires explicit column declaration
"exclude": ["user_id"]
}