Last updated
Copyright © 2023 Arize AI, Inc
Last updated
As teams are starting to deploy LLM applications into production, it's important to keep the pillars of LLM Observability in mind:
: Using an LLM Eval to understand and validate the performance of your LLMs and catch problems such as hallucinations or Q&A problems.
: Unique to LLM apps, capturing spans & traces from the common LLM App frameworks like LangChain and LlamaIndex to understand where the application broke.
: Track performance to identify where prompts need to be improved, compare prompt performance across LLMs, and iterate using live production data.
: Save test datasets for CI/CD testing of prompt/model changes and run experiments against those datasets when you make prompt changes.
: Adding your proprietary data into an LLM is accomplished with RAG. Troubleshooting and evaluating the retrieval process is paramount to LLM performance.
: Collect data examples or clusters of problems as datapoints for export into fine tuning workflows.
As teams deploy LLM’s to production the same challenges around performance and task measurement do still exist. Similar to ML Observability, LLM Observability doesn't just stop at surfacing poor product experiences, but enables practitioners to root cause to improve it.
Across this flow, there are multiple areas where teams often run into issues or challenges with LLMs.
Arize helps teams work backwards from the output to pinpoint where exactly the issue is stemming from across their LLM stack.
Fine tune LLMs: For teams that are fine tuning, Arize has workflows to understand things such as what types of phrases or inputs does the model typically provide a bad response on? Teams can use that to then do the fine tuning. Coming soon.
: Collecting and aggregating user feedback to find what responses are bad is important. If you don't get back user feedback, there are ways of using generating LLM-assisted evaluations.
: A big concern many teams have are the risks associated with models hallucinating, and providing incorrect responses shared as factual information.
: If using RAG, it’s hard to locate if you there was a bad retrieval, and if so, where in the process an issue occurred.
: Knowing what prompt templates perform well and which perform poorly.
: And if incorporating agents, it's difficult to or get visibility into how the agent is performing and where the chain failed.
: Arize helps collect all the responses and user feedback, or help generate them, so teams can look at clusters of bad responses for further troubleshooting and analysis.
: Teams can troubleshoot for search and retrieval with vector stores. Such as did it actually pull all the relevant context? Is there even enough relevant context?
Understand why hallucinations are occurring, by easily finding bad responses, uncovering trends, and understanding where improvements need to be made.
With analysis of the prompt template, teams can understand which are performing better than others.
With agent tracing and spans, teams can understand what calls failed, or where in the span issues occurred.
Introducing Arize support for LLM use cases