RAG

This guide shows you how to create and evaluate a retrieval-augmented generation (RAG) application with Arize to improve performance.

In this example, we are building a Q&A chatbot built on the source material of a Paul Graham essay using LlamaIndex.

Trace your RAG app

# Import open-telemetry dependencies
from arize.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from getpass import getpass

# Setup OTEL
tracer_provider = register(
    space_id=getpass("Enter your space ID"),
    api_key=getpass("Enter your API key"),
    project_name="rag-cookbook",
)

# Turn on LlamaIndex Instrumentation
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

!mkdir data
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O data/paul_graham_essay.txt

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# load Paul Graham Essay
documents = SimpleDirectoryReader("data").load_data()

# Create Vector Store Index
index = VectorStoreIndex.from_documents(documents)

# Turn it into a query engine using gpt-4o-mini
query_engine = index.as_query_engine(llm=OpenAI(model="gpt-4o-mini"))

# Create a query
response = query_engine.query("What did Paul Graham work on?")
print(response)

This will create a trace in Arize, which visualizes inputs, outputs, retrieved chunks, embeddings, and LLM prompt parameters.

Evaluate your RAG app

Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

In these two templates, we are determining whether the reference text retrieved is relevant to the question, and whether the answer is correct given the reference text.

RELEVANCE_EVAL_TEMPLATE = """You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.
"""

CORRECTNESS_EVAL_TEMPLATE = """You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
"""

We can evaluate our outputs using Phoenix's llm_classify function. I have a placeholder response_df which has the input, reference, and output columns tied to your RAG app execution.

from phoenix.evals import OpenAIModel, llm_classify

RELEVANCE_RAILS = ["relevant", "unrelated"]
CORRECTNESS_RAILS = ["incorrect", "correct"]

relevance_eval_df = llm_classify(
    dataframe=response_df,
    template=RELEVANCE_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=RELEVANCE_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

correctness_eval_df = llm_classify(
    dataframe=response_df,
    template=CORRECTNESS_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=CORRECTNESS_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

If you inspect each dataframe generated by llm_classify , you'll get a table of evaluation results. Below is a formatted example for correctness_eval_df .

Variable

Text

Input

What were the two main activities Paul Graham worked on before college?

Reference

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

Output

Before college, Paul Graham primarily worked on writing and programming. He focused on creating short stories and experimented with programming on the IBM 1401. His writing often lacked plot but featured characters with strong feelings.

Label

Correct

Explanation

The answer correctly identifies the two main activities Paul Graham worked on before college, which were writing and programming. It also provides additional details about his writing and programming experiences, which are consistent with the reference text.

Next steps

We covered very simple examples of tracing and evaluating a simple RAG app to answer questions using your own data. As you build more RAG capabilities, you'll need more advanced tooling to measure and improve performance.

You can do all of the following in Arize:

You can also follow our end to end colab example which covers the above with code, including changing your chunk size, chunk overlap, k items retrieved, and more.

Last updated 23 days ago

Was this helpful?