Reference (citation) Link

Reference Links in Retrieval Q&A

In chatbots and Q&A systems, many times reference links are provided in the response, along with an answer, to help point users to documentation or pages that contain more information or the source for the answer.

EXAMPLE: Q&A from Arize-Phoenix Documentation

QUESTION: What other models does Arize Phoenix support beyond OpenAI for running Evals?

ANSWER: Phoenix does support a large set of LLM models through the model object. Phoenix supports OpenAI (GPT-4, GPT-4-32k, GPT-3.5 Turbo, GPT-3.5 Instruct, etc...), Azure OpenAI, Google Palm2 Text Bison, and All AWS Bedrock models (Claude, Mistral, etc...).

REFERENCE LINK:

This Eval checks the reference link returned answers the question asked in a conversation

We are continually iterating our templates, view the most up-to-date template .

print(REF_LINK_EVAL_PROMPT_TEMPLATE_STR)

You are given a conversation that contains questions by a CUSTOMER and you are trying
to determine if the documentation page shared by the ASSISTANT correctly answers
the CUSTOMERS questions. We will give you the conversation between the customer
and the ASSISTANT and the text of the documentation returned:
    [CONVERSATION AND QUESTION]:
    {conversation}
    ************
    [DOCUMENTATION URL TEXT]:
    {document_text}
    [DOCUMENTATION URL TEXT]:
You should respond "correct" if the documentation text answers the question the
CUSTOMER had in the conversation. If the documentation roughly answers the question
even in a general way the please answer "correct". If there are multiple questions and a single
question is answered, please still answer "correct". If the text does not answer the
question in the conversation, or doesn't contain information that would allow you
to answer the specific question please answer "incorrect".

How to run the Citation Eval

from phoenix.evals import (
    REF_LINK_EVAL_PROMPT_RAILS_MAP,
    REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(REF_LINK_EVAL_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

Benchmark Results

Each example in the dataset was evaluating using the REF_LINK_EVAL_PROMPT_TEMPLATE_STR above, then the resulting labels were compared against the ground truth label in the benchmark dataset to generate the confusion matrices below.

GPT-4 Results

Reference Link Evals

GPT-4o

Precision

0.96

Recall

0.79

0.87

PreviousAI vs Human (Groundtruth)NextUser Frustration

Last updated 1 month ago

Was this helpful?