Corpus Data

How to create Phoenix datasets and schemas for the corpus data

In , a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. A corpus dataset can be imported into Phoenix as shown below.

Dataframe

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

text

embedding

Voyager 2 is a spacecraft used by NASA to expl...

[-0.02785328, -0.04709944, 0.042922903, 0.0559...

The Staturn Nebula is a planetary nebula in th...

[0.03544901, 0.039175965, 0.014074919, -0.0307...

Eris is a dwarf planet and a trans-Neptunian o...

[0.05506449, 0.0031612846, -0.020452883, -0.02...

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

corpus_schema = px.Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
)

Dataset

Define the dataset by pairing the dataframe with the schema.

corpus_dataset = px.Dataset(corpus_dataframe, corpus_schema)

Application

session = px.launch_app(production_dataset, corpus=corpus_dataset)

PreviousRetrieval (RAG)NextManage the App

Last updated 1 year ago

Was this helpful?