Use Phoenix Evaluators

The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark data.

All evals templates are tested against golden data that are available as part of the LLM eval library's benchmarked data and target precision at 70-90% and F1 at 70-85%.

Retrieval Eval

RAG individual retrieval

Tested on:

MS Marco, WikiQA

Hallucination Eval

Hallucinations on answers to public and private data

Tested on:

Hallucination QA Dataset, Hallucination RAG Dataset

Toxicity Eval

Is the AI response racist, biased or toxic

Tested on:

WikiToxic

Q&A Eval

Private data Q&A Eval

Tested on:

WikiQA

Summarization Eval

Summarization performance

Tested on:

GigaWorld, CNNDM, Xsum

Code Generation Eval

Code writing correctness and readability

Tested on:

WikiSQL, HumanEval, CodeXGlu

Supported Models.

The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.

model = OpenAIModel(model_name="gpt-4",temperature=0.6)
model("What is the largest costal city in France?")

We currently support a growing set of models for LLM Evals, please check out the API section for usage.

Model

Support

GPT-4

✔

GPT-3.5 Turbo

✔

GPT-3.5 Instruct

✔

Azure Hosted Open AI

✔

Palm 2 Vertex

✔

AWS Bedrock

✔

Litellm

✔

Huggingface Llama7B

(use litellm)

Anthropic

✔

Cohere

(use litellm)

How we benchmark pre-tested evals

The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of data.

The above approach allows us to compare models easily in an understandable format:

Hallucination Eval

GPT-4

GPT-3.5

Precision

0.94

Recall

0.75

0.71

0.83

0.81

PreviousHow to: Evals NextHallucinations

Last updated 7 months ago