Use Phoenix Evaluators
Last updated
Last updated
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark data.
All evals templates are tested against golden data that are available as part of the LLM eval library's benchmarked data and target precision at 70-90% and F1 at 70-85%.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
We currently support a growing set of models for LLM Evals, please check out the API section for usage.
GPT-4
✔
GPT-3.5 Turbo
✔
GPT-3.5 Instruct
✔
Azure Hosted Open AI
✔
Palm 2 Vertex
✔
AWS Bedrock
✔
Litellm
✔
Huggingface Llama7B
(use litellm)
Anthropic
✔
Cohere
(use litellm)
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of data.
The above approach allows us to compare models easily in an understandable format:
Precision
0.94
0.94
Recall
0.75
0.71
F1
0.83
0.81
Retrieval Eval
Tested on:
MS Marco, WikiQA
Hallucination Eval
Tested on:
Hallucination QA Dataset, Hallucination RAG Dataset
Toxicity Eval
Tested on:
WikiToxic
Q&A Eval
Tested on:
WikiQA
Summarization Eval
Tested on:
GigaWorld, CNNDM, Xsum
Code Generation Eval
Tested on:
WikiSQL, HumanEval, CodeXGlu