Corpus Data

How to create Phoenix inferences and schemas for the corpus data

In Information Retrieval, a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first retrieved based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. Corpus inferences can be imported into Phoenix as shown below.

Inferences

Below is an example dataframe containing Wikipedia articles along with its embedding vector.

id
text
embedding

1

Voyager 2 is a spacecraft used by NASA to expl...

[-0.02785328, -0.04709944, 0.042922903, 0.0559...

2

The Staturn Nebula is a planetary nebula in th...

[0.03544901, 0.039175965, 0.014074919, -0.0307...

3

Eris is a dwarf planet and a trans-Neptunian o...

[0.05506449, 0.0031612846, -0.020452883, -0.02...

Schema

Below is an appropriate schema for the dataframe above. It specifies the id column and that embedding belongs to text. Other columns, if exist, will be detected automatically, and need not be specified by the schema.

corpus_schema = px.Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
)

Inferences

Define the inferences by pairing the dataframe with the schema.

corpus_inferences = px.Inferences(corpus_dataframe, corpus_schema)

Application

The application launcher accepts the corpus dataset through corpus= parameter.

session = px.launch_app(production_dataset, corpus=corpus_inferences)

Last updated