Quickstart: Retrieval

Debug your Search and Retrieval LLM workflows

This quickstart shows how to start logging your retrievals from your vector datastore to Phoenix and run evaluations.

Notebooks

Follow our tutorial in a notebook with our Langchain and LlamaIndex integrations

Framework

Phoenix Inferences

Phoenix Traces & Spans

LangChain

LlamaIndex

Logging Retrievals to Phoenix (as Inferences)

Step 1: Logging Knowledge Base

The first thing we need is to collect some sample from your vector store, to be able to compare against later. This is to able to see if some sections are not being retrieved, or some sections are getting a lot of traffic where you might want to beef up your context or documents in that area.

text

embedding

Voyager 2 is a spacecraft used by NASA to expl...

[-0.02785328, -0.04709944, 0.042922903, 0.0559...

corpus_schema = px.Schema(
    id_column_name="id",
    document_column_names=EmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="text",
    ),
)

Step 2: Logging Retrieval and Response

We also will be logging the prompt/response pairs from the deployed application.

query

embedding

retrieved_document_ids

relevance_scores

response

who was the first person that walked on the moon

[-0.0126, 0.0039, 0.0217, ...

[7395, 567965, 323794, ...

[11.30, 7.67, 5.85, ...

Neil Armstrong

primary_schema = Schema(
    prediction_id_column_name="id",
    prompt_column_names=RetrievalEmbeddingColumnNames(
        vector_column_name="embedding",
        raw_data_column_name="query",
        context_retrieval_ids_column_name="retrieved_document_ids",
        context_retrieval_scores_column_name="relevance_scores",
    )
    response_column_names="response",
)

Running Evaluations on your Retrievals

In order to run retrieval Evals the following code can be used for quick analysis of common frameworks of LangChain and LlamaIndex.

# Get traces from Phoenix into dataframe 

spans_df = px.active_session().get_spans_dataframe()
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.active_session())
queries_df = get_qa_with_reference(px.active_session())

Once the data is in a dataframe, evaluations can be run on the data. Evaluations can be run on on different spans of data. In the below example we run on the top level spans that represent a single trace.

Q&A and Hallucination Evals

from phoenix.trace import SpanEvaluations, DocumentEvaluations
from phoenix.evals import (
  HALLUCINATION_PROMPT_RAILS_MAP,
  HALLUCINATION_PROMPT_TEMPLATE,
  QA_PROMPT_RAILS_MAP,
  QA_PROMPT_TEMPLATE,
  OpenAIModel,
  llm_classify,
)

# Creating Hallucination Eval which checks if the application hallucinated
hallucination_eval = llm_classify(
  dataframe=queries_df,
  model=OpenAIModel("gpt-4-turbo-preview", temperature=0.0),
  template=HALLUCINATION_PROMPT_TEMPLATE,
  rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
  provide_explanation=True,  # Makes the LLM explain its reasoning
  concurrency=4,
)
hallucination_eval["score"] = (
  hallucination_eval.label[~hallucination_eval.label.isna()] == "factual"
).astype(int)

# Creating Q&A Eval which checks if the application answered the question correctly
qa_correctness_eval = llm_classify(
  dataframe=queries_df,
  model=OpenAIModel("gpt-4-turbo-preview", temperature=0.0),
  template=QA_PROMPT_TEMPLATE,
  rails=list(QA_PROMPT_RAILS_MAP.values()),
  provide_explanation=True,  # Makes the LLM explain its reasoning
  concurrency=4,
)

qa_correctness_eval["score"] = (
  hallucination_eval.label[~qa_correctness_eval.label.isna()] == "correct"
).astype(int)

# Logs the Evaluations back to the Phoenix User Interface (Optional)
px.Client().log_evaluations(
  SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval),
  SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval),
)

The Evals are available in dataframe locally and can be materilazed back to the Phoenix UI, the Evals are attached to the referenced SpanIDs.

The snipit of code above links the Evals back to the spans they were generated against.

Retrieval Chunk Evals


from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

retrieved_documents_eval = llm_classify(
    dataframe=retrieved_documents_df,
    model=OpenAIModel("gpt-4-turbo-preview", temperature=0.0),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)

retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)

px.Client().log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=retrieved_documents_eval))

The calculation is done using the LLM Eval on all chunks returned for the span and the log_evaluations connects the Evals back to the original spans.

PreviousOverview: Retrieval NextConcepts: Retrieval

Last updated 29 days ago

Was this helpful?