What is LLM Observability?
LLM observability is complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.
Last updated
LLM observability is complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.
Last updated
Evaluation - This helps you evaluate how well the response answers the prompt by using a separate evaluation LLM.
LLM Traces & Spans - This gives you visibility into where more complex or agentic workflows broke.
Prompt Engineering - Iterating on a prompt template can help improve LLM results.
Search and Retrieval - Improving the context that goes into the prompt can lead to better LLM responses.
Fine-tuning - Fine-tuning generates a new model that is more aligned with your exact usage conditions for improved performance.
Evaluation is a measure of how well the response answers the prompt.
There are several ways to evaluate LLMs:
You can collect the feedback directly from your users. This is the simplest way but can often suffer from users not being willing to provide feedback or simply forgetting to do so. Other challenges arise from implementing this at scale.
The other approach is to use an LLM to evaluate the quality of the response for a particular prompt. This is more scalable and very useful but comes with typical LLM setbacks.
Learn more about Phoenix LLM Evals library.
For more complex or agentic workflows, it may not be obvious which call in a span or which span in your trace (a run through your entire use case) is causing the problem. You may need to repeat the evaluation process on several spans before you narrow down the problem.
This pillar is largely about diving deep into the system to isolate the issue you are investigating.
Learn more about Phoenix Traces and Spans support.
Prompt engineering is the cheapest, fastest, and often the highest-leverage way to improve the performance of your application. Often, LLM performance can be improved simply by comparing different prompt templates, or iterating on the one you have. Prompt analysis is an important component in troubleshooting your LLM's performance.
Learn about prompt engineering in Arize.
A common way to improve performance is with more relevant information being fed in.
If you can retrieve more relevant information, your prompt improves automatically. Troubleshooting retrieval systems, however, is more complex. Are there queries that don’t have sufficient context? Should you add more context for these queries to get better answers? Or should you change your embeddings or chunking strategy?
Learn more about troubleshooting search and retrieval with Phoenix.
Fine tuning essentially generates a new model that is more aligned with your exact usage conditions. Fine tuning is expensive, difficult, and may need to be done again as the underlying LLM or other conditions of your system change. This is a very powerful technique, requires much higher effort and complexity.
\
\
\