Use Phoenix Evaluators
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark data.
All evals templates are tested against golden data that are available as part of the LLM eval library's benchmarked data and target precision at 70-90% and F1 at 70-85%.
Supported Models.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
We currently support a growing set of models for LLM Evals, please check out the API section for usage.
GPT-4
✔
GPT-3.5 Turbo
✔
GPT-3.5 Instruct
✔
Azure Hosted Open AI
✔
Palm 2 Vertex
✔
AWS Bedrock
✔
Litellm
✔
Huggingface Llama7B
(use litellm)
Anthropic
✔
Cohere
(use litellm)
How we benchmark pre-tested evals
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of data.
The above approach allows us to compare models easily in an understandable format:
Precision
0.94
0.94
Recall
0.75
0.71
F1
0.83
0.81
Last updated