Running Pre-Tested Evals
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark datasets.
All evals templates are tested against golden datasets that are available as part of the LLM eval library's benchmarked datasets and target precision at 70-90% and F1 at 70-85%.
Supported Models.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
We currently support a growing set of models for LLM Evals, please check out the API section for usage.
Model | Support |
---|---|
GPT-4 | ✔ |
GPT-3.5 Turbo | ✔ |
GPT-3.5 Instruct | ✔ |
Azure Hosted Open AI | ✔ |
Palm 2 Vertex | ✔ |
AWS Bedrock | ✔ |
Litellm | (coming soon) |
Huggingface Llama7B | (coming soon) |
Anthropic | (coming soon) |
Cohere | (coming soon) |
How we benchmark pre-tested evals
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of datasets.
The above approach allows us to compare models easily in an understandable format:
Hallucination Eval | GPT-4 | GPT-3.5 |
---|---|---|
Precision | 0.94 | 0.94 |
Recall | 0.75 | 0.71 |
F1 | 0.83 | 0.81 |
Last updated