Corpus Data
How to create Phoenix datasets and schemas for the corpus data
In Information Retrieval, a document is any piece of information the user may want to retrieve, e.g. a paragraph, an article, or a Web page, and a collection of documents is referred to as the corpus. A corpus can provide the knowledge base (of proprietary data) for supplementing a user query in the prompt context to a Large Language Model (LLM) in the Retrieval-Augmented Generation (RAG) use case. Relevant documents are first retrieved based on the user query and its embedding, then the retrieved documents are combined with the query to construct an augmented prompt for the LLM to provide a more accurate response incorporating information from the knowledge base. A corpus dataset can be imported into Phoenix as shown below.
Dataframe
Below is an example dataframe containing Wikipedia articles along with its embedding vector.
id | text | embedding |
---|---|---|
1 | Voyager 2 is a spacecraft used by NASA to expl... | [-0.02785328, -0.04709944, 0.042922903, 0.0559... |
2 | The Staturn Nebula is a planetary nebula in th... | [0.03544901, 0.039175965, 0.014074919, -0.0307... |
3 | Eris is a dwarf planet and a trans-Neptunian o... | [0.05506449, 0.0031612846, -0.020452883, -0.02... |
Schema
Below is an appropriate schema for the dataframe above. It specifies the id
column and that embedding
belongs to text
. Other columns, if exist, will be detected automatically, and need not be specified by the schema.
Dataset
Define the dataset by pairing the dataframe with the schema.
Application
The application launcher accepts the corpus dataset through corpus=
parameter.
Last updated