Dataset and Schema
Detailed descriptions of classes and methods related to Phoenix datasets and schemas
phoenix.Dataset
A dataset containing a split or cohort of data to be analyzed independently or compared to another cohort. Common examples include training, validation, test, or production datasets.
[source]
Parameters
dataframe (pandas.DataFrame): The data to be analyzed or compared.
schema (Schema): A schema that assigns the columns of the dataframe to the appropriate model dimensions (features, predictions, actuals, etc.).
name (Optional[str]): The name used to identify the dataset in the application. If not provided, a random name will be generated.
Attributes
dataframe (pandas.DataFrame): The pandas dataframe of the dataset.
schema (Schema): The schema of the dataset.
name (str): The name of the dataset.
The input dataframe and schema are lightly processed during dataset initialization and are not necessarily identical to the corresponding dataframe
and schema
attributes.
Usage
Define a dataset ds
from a pandas dataframe df
and a schema object schema
by running
Alternatively, provide a name for the dataset that will appear in the application:
ds
is then passed as the primary
or reference
argument to launch_app.
phoenix.Schema
Assigns the columns of a pandas dataframe to the appropriate model dimensions (predictions, actuals, features, etc.). Each column of the dataframe should appear in the corresponding schema at most once.
[source]
Parameters
prediction_id_column_name (Optional[str]): The name of the dataframe's prediction ID column, if one exists. Prediction IDs are strings that uniquely identify each record in a Phoenix dataset (equivalently, each row in the dataframe). If no prediction ID column name is provided, Phoenix will automatically generate unique UUIDs for each record of the dataset upon Dataset initialization.
timestamp_column_name (Optional[str]): The name of the dataframe's timestamp column, if one exists. Timestamp columns must be pandas Series with numeric, datetime or object dtypes.
If the timestamp column has numeric dtype (
int
orfloat
), the entries of the column are interpreted as Unix timestamps, i.e., the number of seconds since midnight on January 1st, 1970.If the column has datetime dtype and contains timezone-naive timestamps, Phoenix assumes those timestamps belong to the local timezone and converts them to UTC.
If the column has datetime dtype and contains timezone-aware timestamps, those timestamps are converted to UTC.
If the column has object dtype having ISO8601 formatted timestamp strings, those entries are converted to datetime dtype UTC timestamps; if timezone-naive then assumed as belonging to local timezone.
If no timestamp column is provided, each record in the dataset is assigned the current timestamp upon Dataset initialization.
feature_column_names (Optional[List[str]]): The names of the dataframe's feature columns, if any exist. If no feature column names are provided, all dataframe column names that are not included elsewhere in the schema and are not explicitly excluded in
excluded_column_names
are assumed to be features.tag_column_names (Optional[List[str]]): The names of the dataframe's tag columns, if any exist. Tags, like features, are attributes that can be used for filtering records of the dataset while using the app. Unlike features, tags are not model inputs and are not used for computing metrics.
prediction_label_column_name (Optional[str]): The name of the dataframe's predicted label column, if one exists. Predicted labels are used for classification problems with categorical model output.
prediction_score_column_name (Optional[str]): The name of the dataframe's predicted score column, if one exists. Predicted scores are used for regression problems with continuous numerical model output.
actual_label_column_name (Optional[str]): The name of the dataframe's actual label column, if one exists. Actual (i.e., ground truth) labels are used for classification problems with categorical model output.
actual_score_column_name (Optional[str]): The name of the dataframe's actual score column, if one exists. Actual (i.e., ground truth) scores are used for regression problems with continuous numerical output.
prompt_column_names (Optional[EmbeddingColumnNames]): An instance of EmbeddingColumnNames delineating the column names of an LLM model's prompt embedding vector, prompt text, and optionally links to external resources.
response_column_names (Optional[EmbeddingColumnNames]): An instance of EmbeddingColumnNames delineating the column names of an LLM model's response embedding vector, response text, and optionally links to external resources.
embedding_feature_column_names (Optional[Dict[str, EmbeddingColumnNames]]): A dictionary mapping the name of each embedding feature to an instance of EmbeddingColumnNames if any embedding features exist, otherwise, None. Each instance of EmbeddingColumnNames associates one or more dataframe columns containing vector data, image links, or text with the same embedding feature. Note that the keys of the dictionary are user-specified names that appear in the Phoenix UI and do not refer to columns of the dataframe.
excluded_column_names (Optional[List[str]]): The names of the dataframe columns to be excluded from the implicitly inferred list of feature column names. This field should only be used for implicit feature discovery, i.e., when
feature_column_names
is unused and the dataframe contains feature columns not explicitly included in the schema.
Usage
See the guide on how to create your own dataset for examples.
phoenix.EmbeddingColumnNames
A dataclass that associates one or more columns of a dataframe with an embedding feature. Instances of this class are only used as values in a dictionary passed to the embedding_feature_column_names
field of Schema.
[source]
Parameters
vector_column_name (str): The name of the dataframe column containing the embedding vector data. Each entry in the column must be a list, one-dimensional NumPy array, or pandas Series containing numeric values (floats or ints) and must have equal length to all the other entries in the column.
raw_data_column_name (Optional[str]): The name of the dataframe column containing the raw text associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes a piece of text, for example, in the context of NLP.
link_to_data_column_name (Optional[str]): The name of the dataframe column containing links to images associated with an embedding feature, if such a column exists. This field is used when an embedding feature describes an image, for example, in the context of computer vision.
See here for recommendations on handling local image files.
Usage
See the guide on how to create embedding features for examples.
phoenix.TraceDataset
Wraps a dataframe that is a flattened representation of spans and traces. Note that it does not require a Schema. See LLM Traces on how to monitor your LLM application using traces. Because Phoenix can also receive traces from your LLM application directly in real time, TraceDataset
is mostly used for loading trace data that has been previously saved to file.
[source]
Parameters
dataframe (pandas.dataframe): a dataframe each row of which is a flattened representation of a span. See LLM Traces for more on traces and spans.
name (str): The name used to identify the dataset in the application. If not provided, a random name will be generated.
Attributes
dataframe (pandas.dataframe): a dataframe each row of which is a flattened representation of a span. See LLM Traces for more on traces and spans.
name (Optional[str]): The name used to identify the dataset in the application.
Usage
The code snippet below shows how to read data from a trace.jsonl
file into a TraceDataset
, and then pass the dataset to Phoenix through launch_app
. Each line of the trace.jsol
file is a JSON string representing a span.
Last updated