Reference (citation) Link

In chatbots and Q&A systems, many times reference links are provided in the response, along with an answer, to help point users to documentation or pages that contain more information or the source for the answer.

EXAMPLE: Q&A from Arize-Phoenix Documentation

QUESTION: What other models does Arize Phoenix support beyond OpenAI for running Evals?

ANSWER: Phoenix does support a large set of LLM models through the model object. Phoenix supports OpenAI (GPT-4, GPT-4-32k, GPT-3.5 Turbo, GPT-3.5 Instruct, etc...), Azure OpenAI, Google Palm2 Text Bison, and All AWS Bedrock models (Claude, Mistral, etc...).

REFERENCE LINK: https://docs.arize.com/phoenix/api/evaluation-models

This Eval checks the reference link returned answers the question asked in a conversation

We are continually iterating our templates, view the most up-to-date template on GitHub.

print(REF_LINK_EVAL_PROMPT_TEMPLATE_STR)

You are given a conversation that contains questions by a CUSTOMER and you are trying
to determine if the documentation page shared by the ASSISTANT correctly answers
the CUSTOMERS questions. We will give you the conversation between the customer
and the ASSISTANT and the text of the documentation returned:
    [CONVERSATION AND QUESTION]:
    {conversation}
    ************
    [DOCUMENTATION URL TEXT]:
    {document_text}
    [DOCUMENTATION URL TEXT]:
You should respond "correct" if the documentation text answers the question the
CUSTOMER had in the conversation. If the documentation roughly answers the question
even in a general way the please answer "correct". If there are multiple questions and a single
question is answered, please still answer "correct". If the text does not answer the
question in the conversation, or doesn't contain information that would allow you
to answer the specific question please answer "incorrect".
from phoenix.evals import (
    REF_LINK_EVAL_PROMPT_RAILS_MAP,
    REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(REF_LINK_EVAL_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=REF_LINK_EVAL_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
    provide_explanation=True, #optional to generate explanations for the value produced by the eval LLM
)

Benchmark Results

GPT-4 Results

GPT-3.5

GPT-4 Turbo

Reference Link Evals
GPT-4o
GPT-4
GPT-4 Turbo
Gemini Pro
GPT-3.5
Claude V2
Palm 2

Precision

0.96

0.97

0.94

0.77

0.89

0.74

0.68

Recall

0.79

0.83

0.69

0.97

0.43

0.48

0.98

F1

0.87

0.89

0.79

0.86

0.58

0.58

0.80

Last updated