Concepts: LLM Evaluation
Grading the performance of your LLM application
Last updated
Copyright © 2023 Arize AI, Inc
Grading the performance of your LLM application
Last updated
It’s extremely easy to start building an AI application using LLMs because developers no longer have to collect labeled data or train a model. They only need to create a prompt to ask the model for what they want. However, this comes with tradeoffs. LLMs are generalized models that aren’t fine tuned for a specific task. With a standard prompt, these applications demo really well, but in production environments, they often fail in more complex scenarios.
In order to ensure that your application performs well, you need a way to judge the quality of your LLM outputs. You can do this in a variety of different ways, for example evaluating based on relevance, hallucination %, and latency.
When you adjust your prompts or retrieval strategy, you will know whether your application has improved and by how much using evaluation. When running evaluations, it's important to select a comprehensive dataset which will determine how trustworthy and generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.