Last updated
Was this helpful?
Last updated
Was this helpful?
The standard for evaluating text is human labeling. As LLMs become smarter, faster, and cheaper to use, LLM evaluation has become a popular way to evaluate application performance, especially for qualitative measurements that cannot be replicated with code evaluations. The library is designed for simple, fast, and accurate LLM-based evaluations.
LLM evaluation requires you to define:
The input dataset: Depending on what you are trying to measure or critique, the input data to your evaluation can consist of your applications input, output and prompt variables.
The eval prompt template: this is where you specify your criteria, input data, and output labels to judge the quality of the LLM output.
The output: the LLM evaluator generates eval labels and explanations to showcase why it gave it a certain label or score.
LLM evaluation is extremely flexible, because you can specify the rules and criteria in mostly plain language, similar to how you would ask human evaluators to grade your responses. You can run thousands of evaluations across curated data without the need for human annotation. This speeds up your prompt iteration and ensures you can deploy your applications to production with confidence.
Judge LLM outputs using LLMs