Retrieval Evaluations
Troubleshoot Retrieval with Vector Stores
Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.
How Search and Retrieval Works
Here's an example of what retrieval looks like for a Chatbot Application. A user asked a specific question, an embedding was generated from the query, and all relevant context in the knowledge base was pulled and added into the prompt to the LLM.
Common Problems with Search and Retrieval Systems
When the application using RAG doesn't give a good response, it can be because of different reasons. The common issues we see are
There weren't enough documents to answer the question
The document retrieved wasn't good enough to answer
Arize helps evaluate how good retrieval is and identify where it went wrong.
Logging data to Arize for Search and Retrieval Tracing
Arize logs both a sample of the knowledge base and the production prompt/response pairs of the deployed application. Here's a high-level view of what is logged:
Step 1: Logging a Corpus Dataset (Knowledge Base)
The first thing we need is to collect documents from your vector store, to be able to compare against later. This is to be able to see if some sections are not being retrieved, or some sections are getting a lot of traffic where you might want to beef up your context or documents in that area.
Example of Corpus Dataframe (Knowledge Base)
document_id | text | text_vector |
---|---|---|
|
|
|
Step 2: Logging Production Prompt/Responses to Arize
We also will be logging the prompt/response pairs from the deployed application.
Example Dataframe: prompts-response.df
prediction-ID | user-query | query-vector | document | document-vector | response | response vector | user feedback |
---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
Tracing Search and Retrieval Systems with Arize
Issue #1: Bad Response
The first issue we see, and often the easiest to uncover, is bad responses. Navigate to the Embeddings projctor tab to debug your search and retrieval.
If you have logged performance metrics (like user feedback or eval scores on the LLM response), we will automatically surface up any clusters that received poor feedback. \
Note: you may need to create a custom metric based on an eval tag to surface up
Bad responses are often the result of something else going on. Your LLM is likely not giving a poor response for no reason. Next, we will show you how to trace it back to the root of the problem.
Issue #2: Don't Have Any Documents Close Enough
Maybe, the retriever wasn’t able to find any documents that were close enough to the query embedding. This means that users are asking questions about context that is missing from your corpus.
Arize can help you identify if there is context that is missing from your corpus.
By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbot responses.
By setting my "production" dataset as the user queries, and the "baseline" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example above.
Issue #3: Most Similar != Most Relevant Document
There is also the possibility that the document that was retrieved was considered most similar, and had the closest embedding to the query, but wasn’t actually the most relevant document to answer the user’s question appropriately.
Arize can help uncover when irrelevant context is retrieved with LLM-assisted ranking metrics.
By ranking the relevance of the context retrieved, we can help you identify areas to dig into to improve the retrieval.
In order to catch these instances where the most relevant context might not be the most “similar” Arize sends the user query and context retrieved to GPT-4, or another LLM, and asks it to rank or provide a score on the relevance of the context retrieved.
In the example above, both of the pieces of context retrieved got an "irrelevant" score or precision@k ranking of 0. We can also see this coincides with receiving negative user feedback on the response.
Troubleshooting Tip:
Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Use our Open AI Cluster Summarization tool to quickly get a summary of the selected cluster for quick analysis. Learn more about Cluster Summarization here.
Last updated