Corpus Data

How to create Phoenix datasets and schemas for the corpus data

In Information Retrieval, a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first retrieved based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. A corpus dataset can be imported into Phoenix as shown below.

Dataframe

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

corpus_schema = px.Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
)

Dataset

Define the dataset by pairing the dataframe with the schema.

corpus_dataset = px.Dataset(corpus_dataframe, corpus_schema)

Application

The application launcher accepts the corpus dataset through corpus= parameter.

session = px.launch_app(production_dataset, corpus=corpus_dataset)

Last updated

Change request #357: Update Phoenix Inferences Quickstart