Import Your Data
How to create Phoenix datasets and schemas for common data formats
This guide shows you how to define a Phoenix dataset using your own data.
- For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of datasets and schemas, see Phoenix Basics.
Once you have a pandas dataframe
df
containing your data and a schema
object describing the format of your dataframe, you can define your Phoenix dataset either by runningds = px.Dataset(df, schema)
or by optionally providing a name for your dataset that will appear in the UI:
ds = px.Dataset(df, schema, name="training")
As you can see, instantiating your dataset is the easy part. Before you run the code above, you must first wrangle your data into a pandas dataframe and then create a Phoenix schema to describe the format of your dataframe. The rest of this guide shows you how to match your schema to your dataframe with concrete examples.
Let's first see how to define a schema with predictions and actuals (Phoenix's nomenclature for ground truth). The example dataframe below contains inference data from a binary classification model trained to predict whether a user will click on an advertisement. The timestamps are
datetime.datetime
objects that represent the time at which each inference was made in production.timestamp | prediction_score | prediction | target |
---|---|---|---|
2023-03-01 02:02:19 | 0.91 | click | click |
2023-02-17 23:45:48 | 0.37 | no_click | no_click |
2023-01-30 15:30:03 | 0.54 | click | no_click |
2023-02-03 19:56:09 | 0.74 | click | click |
2023-02-24 04:23:43 | 0.37 | no_click | click |
schema = px.Schema(
timestamp_column_name="timestamp",
prediction_score_column_name="prediction_score",
prediction_label_column_name="prediction",
actual_label_column_name="target",
)
This schema defines predicted and actual labels and scores, but you can run Phoenix with any subset of those fields, e.g., with only predicted labels.
Phoenix accepts not only predictions and ground truth but also input features of your model and tags that describe your data. In the example below, features such as FICO score and merchant ID are used to predict whether a credit card transaction is legitimate or fraudulent. In contrast, tags such as age and gender are not model inputs, but are used to filter your data and analyze meaningful cohorts in the app.
fico_score | merchant_id | loan_amount | annual_income | home_ownership | num_credit_lines | inquests_in_last_6_months | months_since_last_delinquency | age | gender | predicted | target |
---|---|---|---|---|---|---|---|---|---|---|---|
578 | Scammeds | 4300 | 62966 | RENT | 110 | 0 | 0 | 25 | male | not_fraud | fraud |
507 | Schiller Ltd | 21000 | 52335 | RENT | 129 | 0 | 23 | 78 | female | not_fraud | not_fraud |
656 | Kirlin and Sons | 18000 | 94995 | MORTGAGE | 31 | 0 | 0 | 54 | female | uncertain | uncertain |
414 | Scammeds | 18000 | 32034 | LEASE | 81 | 2 | 0 | 34 | male | fraud | not_fraud |
512 | Champlin and Sons | 20000 | 46005 | OWN | 148 | 1 | 0 | 49 | male | uncertain | uncertain |
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
feature_column_names=[
"fico_score",
"merchant_id",
"loan_amount",
"annual_income",
"home_ownership",
"num_credit_lines",
"inquests_in_last_6_months",
"months_since_last_delinquency",
],
tag_column_names=[
"age",
"gender",
],
)
If your data has a large number of features, it can be inconvenient to list them all. For example, the breast cancer dataset below contains 30 features that can be used to predict whether a breast mass is malignant or benign. Instead of explicitly listing each feature, you can leave the
feature_column_names
field of your schema set to its default value of None
, in which case, any columns of your dataframe that do not appear in your schema are implicitly assumed to be features.target | predicted | mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
malignant | benign | 15.49 | 19.97 | 102.40 | 744.7 | 0.11600 | 0.15620 | 0.18910 | 0.09113 | 0.1929 | 0.06744 | 0.6470 | 1.3310 | 4.675 | 66.91 | 0.007269 | 0.02928 | 0.04972 | 0.01639 | 0.01852 | 0.004232 | 21.20 | 29.41 | 142.10 | 1359.0 | 0.1681 | 0.3913 | 0.55530 | 0.21210 | 0.3187 | 0.10190 |
malignant | malignant | 17.01 | 20.26 | 109.70 | 904.3 | 0.08772 | 0.07304 | 0.06950 | 0.05390 | 0.2026 | 0.05223 | 0.5858 | 0.8554 | 4.106 | 68.46 | 0.005038 | 0.01503 | 0.01946 | 0.01123 | 0.02294 | 0.002581 | 19.80 | 25.05 | 130.00 | 1210.0 | 0.1111 | 0.1486 | 0.19320 | 0.10960 | 0.3275 | 0.06469 |
malignant | malignant | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.71190 | 0.26540 | 0.4601 | 0.11890 |
benign | benign | 14.53 | 13.98 | 93.86 | 644.2 | 0.10990 | 0.09242 | 0.06895 | 0.06495 | 0.1650 | 0.06121 | 0.3060 | 0.7213 | 2.143 | 25.70 | 0.006133 | 0.01251 | 0.01615 | 0.01136 | 0.02207 | 0.003563 | 15.80 | 16.93 | 103.10 | 749.9 | 0.1347 | 0.1478 | 0.13730 | 0.10690 | 0.2606 | 0.07810 |
benign | benign | 10.26 | 14.71 | 66.20 | 321.6 | 0.09882 | 0.09159 | 0.03581 | 0.02037 | 0.1633 | 0.07005 | 0.3380 | 2.5090 | 2.394 | 19.33 | 0.017360 | 0.04671 | 0.02611 | 0.01296 | 0.03675 | 0.006758 | 10.88 | 19.48 | 70.89 | 357.1 | 0.1360 | 0.1636 | 0.07162 | 0.04074 | 0.2434 | 0.08488 |
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
)
You can tell Phoenix to ignore certain columns of your dataframe when implicitly inferring features by adding those column names to the
excluded_column_names
field of your schema. The dataframe below contains all the same data as the breast cancer dataset above, in addition to "hospital" and "insurance_provider" fields that are not features of your model. Explicitly exclude these fields, otherwise, Phoenix will assume that they are features.target | predicted | hospital | insurance_provider | mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
malignant | benign | Pacific Clinics | uninsured | 15.49 | 19.97 | 102.40 | 744.7 | 0.11600 | 0.15620 | 0.18910 | 0.09113 | 0.1929 | 0.06744 | 0.6470 | 1.3310 | 4.675 | 66.91 | 0.007269 | 0.02928 | 0.04972 | 0.01639 | 0.01852 | 0.004232 | 21.20 | 29.41 | 142.10 | 1359.0 | 0.1681 | 0.3913 | 0.55530 | 0.21210 | 0.3187 | 0.10190 |
malignant | malignant | Queens Hospital | Anthem Blue Cross | 17.01 | 20.26 | 109.70 | 904.3 | 0.08772 | 0.07304 | 0.06950 | 0.05390 | 0.2026 | 0.05223 | 0.5858 | 0.8554 | 4.106 | 68.46 | 0.005038 | 0.01503 | 0.01946 | 0.01123 | 0.02294 | 0.002581 | 19.80 | 25.05 | 130.00 | 1210.0 | 0.1111 | 0.1486 | 0.19320 | 0.10960 | 0.3275 | 0.06469 |
malignant | malignant | St. Francis Memorial Hospital | Blue Shield of CA | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.71190 | 0.26540 | 0.4601 | 0.11890 |
benign | benign | Pacific Clinics | Kaiser Permanente | 14.53 | 13.98 | 93.86 | 644.2 | 0.10990 | 0.09242 | 0.06895 | 0.06495 | 0.1650 | 0.06121 | 0.3060 | 0.7213 | 2.143 | 25.70 | 0.006133 | 0.01251 | 0.01615 | 0.01136 | 0.02207 | 0.003563 | 15.80 | 16.93 | 103.10 | 749.9 | 0.1347 | 0.1478 | 0.13730 | 0.10690 | 0.2606 | 0.07810 |
benign | benign | CityMed | Anthem Blue Cross | 10.26 | 14.71 | 66.20 | 321.6 | 0.09882 | 0.09159 | 0.03581 | 0.02037 | 0.1633 | 0.07005 | 0.3380 | 2.5090 | 2.394 | 19.33 | 0.017360 | 0.04671 | 0.02611 | 0.01296 | 0.03675 | 0.006758 | 10.88 | 19.48 | 70.89 | 357.1 | 0.1360 | 0.1636 | 0.07162 | 0.04074 | 0.2434 | 0.08488 |
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
excluded_column_names=[
"hospital",
"insurance_provider",
],
)
Embedding features consist of vector data in addition to any unstructured data in the form of text or images that the vectors represent. Unlike normal features, a single embedding feature may span multiple columns of your dataframe. Use
px.EmbeddingColumnNames
to associate multiple dataframe columns with the same embedding feature.The example in this section contain low-dimensional embeddings for the sake of easy viewing. Your embeddings in practice will typically have much higher dimension.
To define an embedding feature, you must at minimum provide Phoenix with the embedding vector data itself. Specify the dataframe column that contains this data in the
vector_column_name
field on px.EmbeddingColumnNames
. For example, the dataframe below contains tabular credit card transaction data in addition to embedding vectors that represent each row. Notice that:- Unlike other fields that take strings or lists of strings, the argument to
embedding_feature_column_names
is a dictionary. - The key of this dictionary, "transaction_embedding," is not a column of your dataframe but is name you choose for your embedding feature that appears in the UI.
- The values of this dictionary are instances of
px.EmbeddingColumnNames
. - Each entry in the "embedding_vector" column is a list of length 4.
predicted | target | embedding_vector | fico_score | merchant_id | loan_amount | annual_income | home_ownership | num_credit_lines | inquests_in_last_6_months | months_since_last_delinquency |
---|---|---|---|---|---|---|---|---|---|---|
fraud | not_fraud | [-0.97, 3.98, -0.03, 2.92] | 604 | Leannon Ward | 22000 | 100781 | RENT | 108 | 0 | 0 |
fraud | not_fraud | [3.20, 3.95, 2.81, -0.09] | 612 | Scammeds | 7500 | 116184 | MORTGAGE | 42 | 2 | 56 |
not_fraud | not_fraud | [-0.49, -0.62, 0.08, 2.03] | 646 | Leannon Ward | 32000 | 73666 | RENT | 131 | 0 | 0 |
not_fraud | not_fraud | [1.69, 0.01, -0.76, 3.64] | 560 | Kirlin and Sons | 19000 | 38589 | MORTGAGE | 131 | 0 | 0 |
uncertain | uncertain | [1.46, 0.69, 3.26, -0.17] | 636 | Champlin and Sons | 10000 | 100251 | MORTGAGE | 10 | 0 | 3 |
schema = px.Schema(
prediction_label_column_name="predicted",
actual_label_column_name="target",
embedding_feature_column_names={
"transaction_embeddings": px.EmbeddingColumnNames(
vector_column_name="embedding_vector"
),
},
)
The features in this example are implicitly inferred to be the columns of the dataframe that do not appear in the schema.
To compare embeddings, Phoenix uses metrics such as Euclidean distance that can only be computed between vectors of the same length. Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.
If your embeddings represent images, you can provide links or local paths to image files you want to display in the app by using the
link_to_data_column_name
field on px.EmbeddingColumnNames
. The following example contains data for an image classification model that detects product defects on an assembly line.defective | image | image_vector |
---|---|---|
okay | https://www.example.com/image0.jpeg | [1.73, 2.67, 2.91, 1.79, 1.29] |
defective | https://www.example.com/image1.jpeg | [2.18, -0.21, 0.87, 3.84, -0.97] |
okay | https://www.example.com/image2.jpeg | [3.36, -0.62, 2.40, -0.94, 3.69] |
defective | https://www.example.com/image3.jpeg | [2.77, 2.79, 3.36, 0.60, 3.10] |
okay | https://www.example.com/image4.jpeg | [1.79, 2.06, 0.53, 3.58, 0.24] |
schema = px.Schema(
actual_label_column_name="defective",
embedding_feature_column_names={
"image_embedding": px.EmbeddingColumnNames(
vector_column_name="image_vector",
link_to_data_column_name="image",
),
},
)
For local image data, we recommend the following steps to serve your images via a local HTTP server:
- 1.In your terminal, navigate to a directory containing your image data and run
python -m http.server 8000
. - 2.Add URLs of the form "http://localhost:8000/rel/path/to/image.jpeg" to the appropriate column of your dataframe.
For example, suppose your HTTP server is running in a directory with the following contents:
.
└── image-data
└── example_image.jpeg
Then your image URL would be http://localhost:8000/image-data/example_image.jpeg.
If your embeddings represent pieces of text, you can display that text in the app by using the
raw_data_column_name
field on px.EmbeddingColumnNames
. The embeddings below were generated by a sentiment classification model trained on product reviews.name | text | text_vector | category | sentiment |
---|---|---|---|---|
Magic Lamp | Makes a great desk lamp! | [2.66, 0.89, 1.17, 2.21] | office | positive |
Ergo Desk Chair | This chair is pretty comfortable, but I wish it had better back support. | [3.33, 1.14, 2.57, 2.88] | office | neutral |
Cloud Nine Mattress | I've been sleeping like a baby since I bought this thing. | [2.5, 3.74, 0.04, -0.94] | bedroom | positive |
Dr. Fresh's Spearmint Toothpaste | Avoid at all costs, it tastes like soap. | [1.78, -0.24, 1.37, 2.6] | personal_hygiene | negative |
Ultra-Fuzzy Bath Mat | Cheap quality, began fraying at the edges after the first wash. | [2.71, 0.98, -0.22, 2.1] | bath | negative |
schema = px.Schema(
actual_label_column_name="sentiment",
feature_column_names=[
"category",