What is LLM Observability?
Introducing Arize support for LLM use cases
Last updated
Introducing Arize support for LLM use cases
Last updated
Copyright © 2023 Arize AI, Inc
As teams are starting to deploy LLM applications into production, it's important to keep the pillars of LLM Observability in mind:
Evaluation: Using an LLM Eval to understand and validate the performance of your LLMs and catch problems such as hallucinations or Q&A problems.
LLM Traces and Spans: Unique to LLM apps, capturing spans & traces from the common LLM App frameworks like LangChain and LlamaIndex to understand where the application broke.
Prompt Analysis and Troubleshooting: Track performance to identify where prompts need to be improved, compare prompt performance across LLMs, and iterate using live production data.
Experiments and Datasets: Save test datasets for CI/CD testing of prompt/model changes and run experiments against those datasets when you make prompt changes.
Search and Retrieval: Adding your proprietary data into an LLM is accomplished with RAG. Troubleshooting and evaluating the retrieval process is paramount to LLM performance.
Fine-Tuning: Collect data examples or clusters of problems as datapoints for export into fine tuning workflows.
As teams deploy LLM’s to production the same challenges around performance and task measurement do still exist. Similar to ML Observability, LLM Observability doesn't just stop at surfacing poor product experiences, but enables practitioners to root cause to improve it.
Across this flow, there are multiple areas where teams often run into issues or challenges with LLMs.
Collecting evaluation data is hard: Collecting and aggregating user feedback to find what responses are bad is important. If you don't get back user feedback, there are ways of using generating LLM-assisted evaluations.
LLM Hallucinations/Bad Responses: A big concern many teams have are the risks associated with models hallucinating, and providing incorrect responses shared as factual information.
Bad Retrieval: If using RAG, it’s hard to locate if you there was a bad retrieval, and if so, where in the process an issue occurred.
Bad Prompts and Prompt Templates: Knowing what prompt templates perform well and which perform poorly.
Tracing where the chain failed: And if incorporating agents, it's difficult to or get visibility into how the agent is performing and where the chain failed.
Arize helps teams work backwards from the output to pinpoint where exactly the issue is stemming from across their LLM stack.
Arize Collects &/or Generates Evaluations: Arize helps collect all the responses and user feedback, or help generate them, so teams can look at clusters of bad responses for further troubleshooting and analysis.
Troubleshoot Bad Retrieval: Teams can troubleshoot for search and retrieval with vector stores. Such as did it actually pull all the relevant context? Is there even enough relevant context?
Catch LLM Hallucinations/Bad Responses: Understand why hallucinations are occurring, by easily finding bad responses, uncovering trends, and understanding where improvements need to be made.
Uncover Bad Prompts and Prompt Templates: With analysis of the prompt template, teams can understand which are performing better than others.
Trace where the chain failed: With agent tracing and spans, teams can understand what calls failed, or where in the span issues occurred.
Fine tune LLMs: For teams that are fine tuning, Arize has workflows to understand things such as what types of phrases or inputs does the model typically provide a bad response on? Teams can use that to then do the fine tuning. Coming soon.