Trace and Evaluate RAG
Last updated
Was this helpful?
Last updated
Was this helpful?
This guide shows you how to create and evaluate a retrieval-augmented generation (RAG) application with Arize to improve performance.
In this example, we are building a Q&A chatbot built on the source material of a Paul Graham essay using LlamaIndex.
We have auto-instrumentation for many forms of RAG, including Llamaindex, Langchain, and manual instrumentation.
To trace a RAG application, you can use arize-otel, our convenience package for setting up OTEL tracing along with openinference auto-instrumentation, which maps LLM metadata to a standardized set of trace and span attributes.
Here's some sample code on logging all LlamaIndex calls to Arize. You'll need to sign up for Arize and get your API key (see guide).
Let's create our LlamaIndex query engine (LlamaIndex guide here). We'll download the Paul Graham essay and create a simple RAG application. There are a lot of abstractions present here, but this lets us setup the simple use case of querying for information from a document.
This will create a trace in Arize, which visualizes inputs, outputs, retrieved chunks, embeddings, and LLM prompt parameters.
Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.
In these two templates, we are determining whether the reference text retrieved is relevant to the question, and whether the answer is correct given the reference text.
Read more about Retrieval Evaluation.
We can evaluate our outputs using Phoenix's llm_classify function. I have a placeholder response_df
which has the input, reference, and output columns tied to your RAG app execution.
If you inspect each dataframe generated by llm_classify
, you'll get a table of evaluation results. Below is a formatted example for correctness_eval_df
.
Input
What were the two main activities Paul Graham worked on before college?
Reference
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
Output
Before college, Paul Graham primarily worked on writing and programming. He focused on creating short stories and experimented with programming on the IBM 1401. His writing often lacked plot but featured characters with strong feelings.
Label
Correct
Explanation
The answer correctly identifies the two main activities Paul Graham worked on before college, which were writing and programming. It also provides additional details about his writing and programming experiences, which are consistent with the reference text.
We covered very simple examples of tracing and evaluating a simple RAG app to answer questions using your own data. As you build more RAG capabilities, you'll need more advanced tooling to measure and improve performance.
You can do all of the following in Arize:
Manually create retriever spans which log your retriever inputs and outputs.
Evaluate your RAG app across more dimensions, such as hallucinations, citation, and user frustration.
Create experiments to track changes across models, prompts, and parameters (more info on experiments).
You can also follow our end to end colab example which covers the above with code, including changing your chunk size, chunk overlap, k items retrieved, and more.