Overview: LLM Evaluation
Grading the performance of your LLM application
Last updated
Was this helpful?
Grading the performance of your LLM application
Last updated
Was this helpful?
In order to ensure that your application performs well, you need a way to judge the quality of your LLM outputs. You can evaluate many different factors, such as relevance, hallucination %, and latency.
Without evals, AI engineers donβt know if prompt changes will actually improve performance. They don't know whether changing a prompt, LLM parameter or retrieval step, will it break a use case or actually improve performance?
With evals, when you adjust your prompts, agents, and retrieval, you will know whether your application has improved.
Once you have evaluation metrics, you need a good evaluation process. You need a set of examples that you've curated to test against a variety of use cases, and run those tests on every build in your CI/CD pipeline. Now you can constantly test your application for reliability on every change.
It's important to select a comprehensive dataset which will determine how trustworthy and generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.
To get started with code, check out the Quickstart guide for LLM evaluation.
Learn more about evaluation concepts by reading Evaluation Basics and our definitive guide on LLM app evaluation.
Trying to setup online evaluations that run continuously on your data? See our online evaluations guide.