Phoenix
Ask or search…
K
Links

Running Pre-Tested Evals

The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark datasets.
All evals templates are tested against golden datasets that are available as part of the LLM eval library's benchmarked datasets and target precision at 70-90% and F1 at 70-85%.
Retrieval Eval
Tested on:
MS Marco, WikiQA
Hallucination Eval
Tested on:
Hallucination QA Dataset, Hallucination RAG Dataset
Summarization Eval
Tested on:
GigaWorld, CNNDM, Xsum

Supported Models.

The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
model = OpenAIModel(model_name="gpt-4",temperature=0.6)
model("What is the largest costal city in France?")
We currently support a growing set of models for LLM Evals, please check out the API section for usage.
Model
Support
GPT-4
GPT-3.5 Turbo
GPT-3.5 Instruct
Azure Hosted Open AI
Palm 2 Vertex
AWS Bedrock
Litellm
Huggingface Llama7B
(coming soon)
Anthropic
(coming soon)
Cohere
(coming soon)

How we benchmark pre-tested evals

The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of datasets.
The above approach allows us to compare models easily in an understandable format:
Hallucination Eval
GPT-4
GPT-3.5
Precision
0.94
0.94
Recall
0.75
0.71
F1
0.83
0.81