Last updated
Was this helpful?
Last updated
Was this helpful?
In order to ensure that your application performs well, you need a way to judge the quality of your LLM outputs. You can evaluate many different factors, such as relevance, hallucination %, and latency.
Without evals, AI engineers donβt know if prompt changes will actually improve performance. They don't know whether changing a prompt, LLM parameter or retrieval step, will it break a use case or actually improve performance?
With evals, when you adjust your prompts, agents, and retrieval, you will know whether your application has improved.
It's important to select a comprehensive dataset which will determine how trustworthy and generalizable your evaluation metrics are to production use. A limited dataset could showcase high scores on evaluation metrics, but perform poorly in real-world scenarios.
Once you have evaluation metrics, you need a good evaluation process. You need a set of examples that you've curated to test against a variety of use cases, and run those tests on every build in your . Now you can constantly test your application for reliability on every change.
To get started with code, check out the guide for LLM evaluation.
Learn more about evaluation concepts by reading and .
Trying to setup online evaluations that run continuously on your data? See our guide.
Grading the performance of your LLM application