LLM as a Judge
Last updated
Was this helpful?
Last updated
Was this helpful?
The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix LLM Evals library is designed for simple, fast, and accurate LLM-based evaluations.
Test LLM evaluation in playground
Run evaluations in the UI
Run evaluations with code
Learn about LLM evaluation concepts
To get started, check out the Quickstart guide for evaluation. You can use Arize Evaluation templates or build your own. Read our best practices to learn how to run robust task-based evaluations on your LLM applications. Finally, check out our comprehensive guide to building and benchmarking LLM evals.