Troubleshoot Retrieval with Vector Stores
Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company's knowledge base/context to answer specific questions.
Here's an example of what retrieval looks like for a Chatbot Application. A user asked a specific question, an embedding was generated from the query, and all relevant context in the knowledge base was pulled and added into the prompt to the LLM.
When the application using RAG doesn't give a good response, it can be because of different reasons. The common issues we see are
- There weren't enough documents to answer the question
- The document retrieved wasn't good enough to answer
Arize helps evaluate how good retrieval is and identify where it went wrong.
Arize logs both a sample of the knowledge base and the production prompt/response pairs of the deployed application. Here's a high-level view of what is logged:
The first thing we need is to collect documents from your vector store, to be able to compare against later. This is to be able to see if some sections are not being retrieved, or some sections are getting a lot of traffic where you might want to beef up your context or documents in that area.
# Logging the Sample of the Corpus
from arize.pandas.logger import Client, Schema
from arize.utils.types import EmbeddingColumnNames
API_KEY = 'ARIZE_API_KEY'
SPACE_KEY = 'YOUR SPACE KEY'
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
response = arize_client.log(
dataframe=corpus-knowledge-base.df, # Refers to the above dataframe with the example row
We also will be logging the prompt/response pairs from the deployed application.
Example Dataframe: prompts-response.df
# Logging the production prompt and response pairs
# Declare embedding feature columns
# Define the Schema, including embedding information
schema = Schema(
template_version_column_name = "prompt_template_name",
# Log the dataframe with the schema mapping
response = arize_client.log(
dataframe=prompts-response.df, # Refers to the above dataframe with the example row
The first issue we see, and often the easiest to uncover, is bad responses. Navigate to the Embeddings projctor tab to debug your search and retrieval.
If you have logged performance metrics (like user feedback or eval scores on the LLM response), we will automatically surface up any clusters that received poor feedback.
Note: you may need to create a custom metric based on an eval tag to surface up
Create a custom metric in Arize to capture user feedback
Arize will automatically surface up clusters with the worst user feedback
Bad responses are often the result of something else going on. Your LLM is likely not giving a poor response for no reason. Next, we will show you how to trace it back to the root of the problem.
Maybe, the retriever wasn’t able to find any documents that were close enough to the query embedding. This means that users are asking questions about context that is missing from your corpus.
Arize can help you identify if there is context that is missing from your corpus.
By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbot responses.
Visualize Query Density (Euclidean or Cosine Distance)
By setting my "production" dataset as the user queries, and the "baseline" dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example above.
There is also the possibility that the document that was retrieved was considered most similar, and had the closest embedding to the query, but wasn’t actually the most relevant document to answer the user’s question appropriately.
Arize can help uncover when irrelevant context is retrieved with LLM-assisted ranking metrics.
By ranking the relevance of the context retrieved, we can help you identify areas to dig into to improve the retrieval.
In order to catch these instances where the most relevant context might not be the most “similar” Arize sends the user query and context retrieved to GPT-4, or another LLM, and asks it to rank or provide a score on the relevance of the context retrieved.
In the example above, both of the pieces of context retrieved got an "irrelevant" score or precision@k ranking of 0. We can also see this coincides with receiving negative user feedback on the response.
Found a problematic cluster you want to dig into, but don't want to manually sift through all of the prompts and responses? Use our Open AI Cluster Summarization tool to quickly get a summary of the selected cluster for quick analysis. Learn more about Cluster Summarization here.