arize-phoenix-evals
Tooling to evaluate LLM applications including RAG relevance, answer relevance, and more.
Last updated
Was this helpful?
Tooling to evaluate LLM applications including RAG relevance, answer relevance, and more.
Last updated
Was this helpful?
Phoenix's approach to LLM evals is notable for the following reasons:
Includes pre-tested templates and convenience functions for a set of common Eval “tasks”
Data science rigor applied to the testing of model and template combinations
Designed to run as fast as possible on batches of data
Includes benchmark datasets and tests for each eval function
Install the arize-phoenix sub-package via pip
Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:
Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:
To learn more about LLM Evals, see the .