What is LLM Observability?
Introducing Arize support for LLM use cases
The use of GPT-4 as a replacement for various model tasks is growing daily. What many teams consider a model today, may just be a prompt & response pair in the future.
As teams deploy LLM’s to production the same challenges around performance and task measurement do still exist. Similar to ML Observability, LLM Observability doesn't just stop at surfacing poor product experiences, but enables practitioners to root cause to improve it.
Across this flow, there are multiple areas where teams often run into issues or challenges with LLMs.
- Collecting/Generating Evaluation is Hard: Collecting and aggregating user feedback to find what responses are bad is important. If you don't get back user feedback, there are ways of using generating LLM-assisted evaluations.
- LLM Hallucinations/Bad Responses: A big concern many teams have are the risks associated with models hallucinating, and providing incorrect responses shared as factual information.
- Bad Retrieval: If using RAG, it’s hard to locate if you there was a bad retrieval, and if so, where in the process an issue occurred.
- Bad Prompts and Prompt Templates: Knowing what prompt templates perform well and which perform poorly.
- Tracing where the chain failed: And if incorporating agents, it's difficult to or get visibility into how the agent is performing and where the chain failed.
Arize helps teams work backwards from the output to pinpoint where exactly the issue is stemming from across their LLM stack.
- Arize Collects &/or Generates Evaluations: Arize helps collect all the responses and user feedback, or help generate them, so teams can look at clusters of bad responses for further troubleshooting and analysis.
- Trace where the chain failed: With agent tracing and spans, teams can understand what calls failed, or where in the span issues occurred.
- Fine tune LLMs: For teams that are fine tuning, Arize has workflows to understand things such as what types of phrases or inputs does the model typically provide a bad response on? Teams can use that to then do the fine tuning. Coming soon.