Running Pre-Tested Evals
Last updated
Last updated
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark datasets.
All evals templates are tested against golden datasets that are available as part of the LLM eval library's benchmarked datasets and target precision at 70-90% and F1 at 70-85%.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
We currently support a growing set of models for LLM Evals, please check out the API section for usage.
GPT-4
✔
GPT-3.5 Turbo
✔
GPT-3.5 Instruct
✔
Azure Hosted Open AI
✔
Palm 2 Vertex
✔
AWS Bedrock
✔
Litellm
(coming soon)
Huggingface Llama7B
(coming soon)
Anthropic
(coming soon)
Cohere
(coming soon)
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of datasets.
The above approach allows us to compare models easily in an understandable format:
Precision
0.94
0.94
Recall
0.75
0.71
F1
0.83
0.81
Retrieval Eval
Tested on:
MS Marco, WikiQA
Hallucination Eval
Tested on:
Hallucination QA Dataset, Hallucination RAG Dataset
Toxicity Eval
Tested on:
WikiToxic
Q&A Eval
Tested on:
WikiQA
Summarization Eval
Tested on:
GigaWorld, CNNDM, Xsum
Code Generation Eval
Tested on:
WikiSQL, HumanEval, CodeXGlu