Schemas and Datasets

Learn the foundational concepts of the Phoenix API and Application

This section introduces datasets and schemas, the starting concepts needed to use Phoenix.

For comprehensive descriptions of phoenix.Dataset and phoenix.Schema, see the .
For tips on creating your own Phoenix datasets and schemas, see the .

Datasets

A Phoenix dataset is an instance of phoenix.Dataset that contains three pieces of information:

The data itself (a pandas dataframe)
A (a phoenix.Schema instance) that describes the of your dataframe
A dataset name that appears in the UI

For example, if you have a dataframe prod_df that is described by a schema prod_schema, you can define a dataset prod_ds with

prod_ds = px.Dataset(prod_df, prod_schema, "production")

If you launch Phoenix with this dataset, you will see a dataset named "production" in the UI.

How many datasets do I need?

You can launch Phoenix with zero, one, or two datasets.

With no datasets, Phoenix runs in the background and collects trace data emitted by your instrumented LLM application. With a single dataset, Phoenix provides insights into model performance and data quality. With two datasets, Phoenix compares your datasets and gives insights into drift in addition to model performance and data quality, or helps you debug your retrieval-augmented generation applications.

Which dataset is which?

Your reference dataset provides a baseline against which to compare your primary dataset.

To compare two datasets with Phoenix, you must select one dataset as primary and one to serve as a reference. As the name suggests, your primary dataset contains the data you care about most, perhaps because your model's performance on this data directly affects your customers or users. Your reference dataset, in contrast, is usually of secondary importance and serves as a baseline against which to compare your primary dataset.

Very often, your primary dataset will contain production data and your reference dataset will contain training data. However, that's not always the case; you can imagine a scenario where you want to check your test set for drift relative to your training data, or use your test set as a baseline against which to compare your production data. When choosing primary and reference datasets, it matters less where your data comes from than how important the data is and what role the data serves relative to your other data.

Corpus Dataset (Information Retrieval)

Schemas

For example, if you have a dataframe containing Fisher's Iris data that looks like this:

sepal_length

sepal_width

petal_length

petal_width

target

prediction

7.7

3.0

6.1

2.3

virginica

versicolor

5.4

3.9

1.7

0.4

setosa

6.3

3.3

4.7

1.6

versicolor

6.2

3.4

5.4

2.3

virginica

setosa

5.8

2.7

5.1

1.9

virginica

your schema might look like this:

schema = px.Schema(
    feature_column_names=[
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
    ],
    actual_label_column_name="target",
    prediction_label_column_name="prediction",
)

How many schemas do I need?

Usually one, sometimes two.

Each dataset needs a schema. If your primary and reference datasets have the same format, then you only need one schema. For example, if you have dataframes train_df and prod_df that share an identical format described by a schema named schema, then you can define datasets train_ds and prod_ds with

train_ds = px.Dataset(train_df, schema, "training")
prod_ds = px.Dataset(prod_df, schema, "production")

Sometimes, you'll encounter scenarios where the formats of your primary and reference datasets differ. For example, you'll need two schemas if:

Your production data has timestamps indicating the time at which an inference was made, but your training data does not.
A new version of your model has a differing set of features from a previous version.

In cases like these, you'll need to define two schemas, one for each dataset. For example, if you have dataframes train_df and prod_df that are described by schemas train_schema and prod_schema, respectively, then you can define datasets train_ds and prod_ds with

train_ds = px.Dataset(train_df, train_schema, "training")
prod_ds = px.Dataset(prod_df, prod_schema, "production")

Schema for Corpus Dataset (Information Retrieval)

corpus_schema=Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
),
corpus_ds = px.Dataset(corpus_df, corpus_schema)

Application

Phoenix runs as an application that can be viewed in a web browser tab or within your notebook as a cell. To launch the app, simply pass one or more datasets into the launch_app function:

session = px.launch_app(prod_ds, train_ds)
# or just one dataset
session = px.launch_app(prod_ds)
# or with a corpus dataset
session = px.launch_app(prod_ds, corpus=corpus_ds)

PreviousPhoenix Inferences NextWhat is LLM Observability?

Last updated 1 year ago

Was this helpful?