Phoenix
Ask or searchโ€ฆ
โŒƒK
Links

Import Your Data

How to create Phoenix datasets and schemas for common data formats
This guide shows you how to define a Phoenix dataset using your own data.
  • For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of datasets and schemas, see Phoenix Basics.
  • For a comprehensive description of phoenix.Dataset and phoenix.Schema, see the API reference.
Once you have a pandas dataframe df containing your data and a schema object describing the format of your dataframe, you can define your Phoenix dataset either by running
ds = px.Dataset(df, schema)
or by optionally providing a name for your dataset that will appear in the UI:
ds = px.Dataset(df, schema, name="training")
As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.

Predictions and Actuals

Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are datetime.datetime objects that represent the time at which each inference was made in production.

Dataframe

timestamp
prediction_score
prediction
target
2023-03-01 02:02:19
0.91
click
click
2023-02-17 23:45:48
0.37
no_click
no_click
2023-01-30 15:30:03
0.54
click
no_click
2023-02-03 19:56:09
0.74
click
click
2023-02-24 04:23:43
0.37
no_click
click

Schema

schema = px.Schema(
timestamp_column_name="timestamp",
prediction_score_column_name="prediction_score",
prediction_label_column_name="prediction",
actual_label_column_name="target",
)
This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.

Features and Tags

Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.

Dataframe

fico_score
merchant_id
loan_amount
annual_income
home_ownership
num_credit_lines
inquests_in_last_6_months
months_since_last_delinquency
age
gender
predicted
target
578
Scammeds
4300
62966
RENT
110
0
0
25
male
not_fraud
fraud
507
Schiller Ltd
21000
52335
RENT
129
0
23
78
female
not_fraud
not_fraud
656
Kirlin and Sons
18000
94995
MORTGAGE
31
0
0
54
female
uncertain
uncertain
414
Scammeds
18000
32034
LEASE
81
2
0
34
male
fraud
not_fraud
512
Champlin and Sons
20000
46005
OWN
148
1
0
49
male
uncertain
uncertain

Schema

schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
feature_column_names=[
"fico_score",
"merchant_id",
"loan_amount",
"annual_income",
"home_ownership",
"num_credit_lines",
"inquests_in_last_6_months",
"months_since_last_delinquency",
],
tag_column_names=[
"age",
"gender",
],
)

Implicit Features

If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the feature_column_names field of your schema set to its default value of None, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.

Dataframe

target
predicted
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
malignant
benign
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488

Schema

schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
)

Excluded Columns

You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the excluded_column_names field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.

Dataframe

target
predicted
hospital
insurance_provider
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
malignant
benign
Pacific Clinics
uninsured
15.49
19.97
102.40
744.7
0.11600
0.15620
0.18910
0.09113
0.1929
0.06744
0.6470
1.3310
4.675
66.91
0.007269
0.02928
0.04972
0.01639
0.01852
0.004232
21.20
29.41
142.10
1359.0
0.1681
0.3913
0.55530
0.21210
0.3187
0.10190
malignant
malignant
Queens Hospital
Anthem Blue Cross
17.01
20.26
109.70
904.3
0.08772
0.07304
0.06950
0.05390
0.2026
0.05223
0.5858
0.8554
4.106
68.46
0.005038
0.01503
0.01946
0.01123
0.02294
0.002581
19.80
25.05
130.00
1210.0
0.1111
0.1486
0.19320
0.10960
0.3275
0.06469
malignant
malignant
St. Francis Memorial Hospital
Blue Shield of CA
17.99
10.38
122.80
1001.0
0.11840
0.27760
0.30010
0.14710
0.2419
0.07871
1.0950
0.9053
8.589
153.40
0.006399
0.04904
0.05373
0.01587
0.03003
0.006193
25.38
17.33
184.60
2019.0
0.1622
0.6656
0.71190
0.26540
0.4601
0.11890
benign
benign
Pacific Clinics
Kaiser Permanente
14.53
13.98
93.86
644.2
0.10990
0.09242
0.06895
0.06495
0.1650
0.06121
0.3060
0.7213
2.143
25.70
0.006133
0.01251
0.01615
0.01136
0.02207
0.003563
15.80
16.93
103.10
749.9
0.1347
0.1478
0.13730
0.10690
0.2606
0.07810
benign
benign
CityMed
Anthem Blue Cross
10.26
14.71
66.20
321.6
0.09882
0.09159
0.03581
0.02037
0.1633
0.07005
0.3380
2.5090
2.394
19.33
0.017360
0.04671
0.02611
0.01296
0.03675
0.006758
10.88
19.48
70.89
357.1
0.1360
0.1636
0.07162
0.04074
0.2434
0.08488

Schema

schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
excluded_column_names=[
"hospital",
"insurance_provider",
],
)

Embedding Features

Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use px.EmbeddingColumnNames to associate multiple dataframe columns with the same embedding feature.
  • For a conceptual overview of embeddings, see Embeddings.
  • For a comprehensive description of px.EmbeddingColumnNames, see the API reference.
The example in this section contain low-dimensional embeddings for the sake of easy viewing. Your embeddings in practice will typically have much higher dimension.

Embedding Vectors

To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the vector_column_name field on px.EmbeddingColumnNames. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:
  • Unlike other fields that take strings or lists of strings, the argument to embedding_feature_column_names is a dictionary.
  • The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
  • The values of this dictionary are instances of px.EmbeddingColumnNames.
  • Each entry in the "embedding_vector" column is a list of length 4.

Dataframe

predicted
target
embedding_vector
fico_score
merchant_id
loan_amount
annual_income
home_ownership
num_credit_lines
inquests_in_last_6_months
months_since_last_delinquency
fraud
not_fraud
[-0.97, 3.98, -0.03, 2.92]
604
Leannon Ward
22000
100781
RENT
108
0
0
fraud
not_fraud
[3.20, 3.95, 2.81, -0.09]
612
Scammeds
7500
116184
MORTGAGE
42
2
56
not_fraud
not_fraud
[-0.49, -0.62, 0.08, 2.03]
646
Leannon Ward
32000
73666
RENT
131
0
0
not_fraud
not_fraud
[1.69, 0.01, -0.76, 3.64]
560
Kirlin and Sons
19000
38589
MORTGAGE
131
0
0
uncertain
uncertain
[1.46, 0.69, 3.26, -0.17]
636
Champlin and Sons
10000
100251
MORTGAGE
10
0
3

Schema

schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
embedding_feature_column_names={
"transaction_embeddings": px.EmbeddingColumnNames(
vector_column_name="embedding_vector"
),
},
)
The features in this example are implicitly inferred to be the columns of the dataframe that do not appear in the schema.
To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.

Embeddings of Images

If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the link_to_data_column_name field on px.EmbeddingColumnNames. The following example contains data for an image classification model that detects product defects on an assembly line.

Dataframe

defective
image
image_vector
okay
https://www.example.com/image0.jpeg
[1.73, 2.67, 2.91, 1.79, 1.29]
defective
https://www.example.com/image1.jpeg
[2.18, -0.21, 0.87, 3.84, -0.97]
okay
https://www.example.com/image2.jpeg
[3.36, -0.62, 2.40, -0.94, 3.69]
defective
https://www.example.com/image3.jpeg
[2.77, 2.79, 3.36, 0.60, 3.10]
okay
https://www.example.com/image4.jpeg
[1.79, 2.06, 0.53, 3.58, 0.24]

Schema

schema = px.Schema(
actual_label_column_name="defective",
embedding_feature_column_names={
"image_embedding": px.EmbeddingColumnNames(
vector_column_name="image_vector",
link_to_data_column_name="image",
),
},
)

Local Images

For local image data, we recommend the following steps to serve your images via a local HTTP server:
  1. 1.
    In your terminal, navigate to a directory containing your image data and run python -m http.server 8000.
  2. 2.
    Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.
For example, suppose your HTTP server is running in a directory with the following contents:
.
โ””โ”€โ”€ image-data
โ””โ”€โ”€ example_image.jpeg
Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.

Embeddings of Text

If your embeddings represent pieces of text, you can display that text in the app by using the raw_data_column_name field on px.EmbeddingColumnNames. The embeddings below were generated by a sentiment classification model trained on product reviews.

Dataframe

name
text
text_vector
category
sentiment
Magic Lamp
Makes a great desk lamp!
[2.66, 0.89, 1.17, 2.21]
office
positive
Ergo Desk Chair
This chair is pretty comfortable, but I wish it had better back support.
[3.33, 1.14, 2.57, 2.88]
office
neutral
Cloud Nine Mattress
I've been sleeping like a baby since I bought this thing.
[2.5, 3.74, 0.04, -0.94]
bedroom
positive
Dr. Fresh's Spearmint Toothpaste
Avoid at all costs, it tastes like soap.
[1.78, -0.24, 1.37, 2.6]
personal_hygiene
negative
Ultra-Fuzzy Bath Mat
Cheap quality, began fraying at the edges after the first wash.
[2.71, 0.98, -0.22, 2.1]
bath
negative

Schema

schema = px.Schema(
actual_label_column_name="sentiment",
feature_column_names=[
"category",