Dataset and Schema

Detailed descriptions of classes and methods related to Phoenix datasets and schemas

PreviousContribute to Phoenix NextSession

Last updated 1 year ago

Was this helpful?

Dataset and Schema

Detailed descriptions of classes and methods related to Phoenix datasets and schemas

phoenix.Dataset

class Dataset(
    dataframe: pandas.DataFrame,
    schema: Schema,
    name: Optional[str] = None,
)

A dataset containing a split or cohort of data to be analyzed independently or compared to another cohort. Common examples include training, validation, test, or production datasets.

[]

Parameters

dataframe (pandas.DataFrame): The data to be analyzed or compared.
schema (): A schema that assigns the columns of the dataframe to the appropriate model dimensions (features, predictions, actuals, etc.).
name (Optional[str]): The name used to identify the dataset in the application. If not provided, a random name will be generated.

Attributes

dataframe (pandas.DataFrame): The pandas dataframe of the dataset.
schema (): The schema of the dataset.
name (str): The name of the dataset.

The input dataframe and schema are lightly processed during dataset initialization and are not necessarily identical to the corresponding dataframe and schema attributes.

Usage

Define a dataset ds from a pandas dataframe df and a schema object schema by running

ds = px.Dataset(df, schema)

Alternatively, provide a name for the dataset that will appear in the application:

ds = px.Dataset(df, schema, name="training")

phoenix.Schema

class Schema(
    prediction_id_column_name: Optional[str] = None,
    timestamp_column_name: Optional[str] = None,
    feature_column_names: Optional[List[str]] = None,
    tag_column_names: Optional[List[str]] = None,
    prediction_label_column_name: Optional[str] = None,
    prediction_score_column_name: Optional[str] = None,
    actual_label_column_name: Optional[str] = None,
    actual_score_column_name: Optional[str] = None,
    prompt_column_names: Optional[EmbeddingColumnNames] = None
    response_column_names: Optional[EmbeddingColumnNames] = None
    embedding_feature_column_names: Optional[Dict[str, EmbeddingColumnNames]] = None,
    excluded_column_names: Optional[List[str]] = None,
)

Assigns the columns of a pandas dataframe to the appropriate model dimensions (predictions, actuals, features, etc.). Each column of the dataframe should appear in the corresponding schema at most once.

Parameters

timestamp_column_name (Optional[str]): The name of the dataframe's timestamp column, if one exists. Timestamp columns must be pandas Series with numeric, datetime or object dtypes.
- If the timestamp column has numeric dtype (int or float), the entries of the column are interpreted as Unix timestamps, i.e., the number of seconds since midnight on January 1st, 1970.
- If the column has datetime dtype and contains timezone-naive timestamps, Phoenix assumes those timestamps belong to the local timezone and converts them to UTC.
- If the column has datetime dtype and contains timezone-aware timestamps, those timestamps are converted to UTC.
- If the column has object dtype having ISO8601 formatted timestamp strings, those entries are converted to datetime dtype UTC timestamps; if timezone-naive then assumed as belonging to local timezone.
feature_column_names (Optional[List[str]]): The names of the dataframe's feature columns, if any exist. If no feature column names are provided, all dataframe column names that are not included elsewhere in the schema and are not explicitly excluded in excluded_column_names are assumed to be features.
tag_column_names (Optional[List[str]]): The names of the dataframe's tag columns, if any exist. Tags, like features, are attributes that can be used for filtering records of the dataset while using the app. Unlike features, tags are not model inputs and are not used for computing metrics.
prediction_label_column_name (Optional[str]): The name of the dataframe's predicted label column, if one exists. Predicted labels are used for classification problems with categorical model output.
prediction_score_column_name (Optional[str]): The name of the dataframe's predicted score column, if one exists. Predicted scores are used for regression problems with continuous numerical model output.
actual_label_column_name (Optional[str]): The name of the dataframe's actual label column, if one exists. Actual (i.e., ground truth) labels are used for classification problems with categorical model output.
actual_score_column_name (Optional[str]): The name of the dataframe's actual score column, if one exists. Actual (i.e., ground truth) scores are used for regression problems with continuous numerical output.
excluded_column_names (Optional[List[str]]): The names of the dataframe columns to be excluded from the implicitly inferred list of feature column names. This field should only be used for implicit feature discovery, i.e., when feature_column_names is unused and the dataframe contains feature columns not explicitly included in the schema.

Usage

phoenix.EmbeddingColumnNames

class EmbeddingColumnNames(
    vector_column_name: str,
    raw_data_column_name: Optional[str] = None,
    link_to_data_column_name: Optional[str] = None,
)

Parameters

vector_column_name (str): The name of the dataframe column containing the embedding vector data. Each entry in the column must be a list, one-dimensional NumPy array, or pandas Series containing numeric values (floats or ints) and must have equal length to all the other entries in the column.
raw_data_column_name (Optional[str]): The name of the dataframe column containing the raw text associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes a piece of text, for example, in the context of NLP.
link_to_data_column_name (Optional[str]): The name of the dataframe column containing links to images associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes an image, for example, in the context of computer vision.

Usage

phoenix.TraceDataset

class TraceDataset(
    dataframe: pandas.DataFrame,
    name: Optional[str] = None,
)

Parameters

name (str): The name used to identify the dataset in the application. If not provided, a random name will be generated.

Attributes

name (Optional[str]): The name used to identify the dataset in the application.

Usage

The code snippet below shows how to read data from a trace.jsonl file into a TraceDataset, and then pass the dataset to Phoenix through launch_app . Each line of the trace.jsol file is a JSON string representing a span.

from phoenix.trace.utils import json_lines_to_df

with open("trace.jsonl", "r") as f:
    trace_ds = TraceDataset(json_lines_to_df(f.readlines()))
px.launch_app(trace=trace_ds)

PreviousContribute to Phoenix NextSession

Last updated 1 year ago

Was this helpful?