Running Pre-Tested Evals
The following are simple functions on top of the LLM Evals building blocks that are pre-tested with benchmark datasets.
The models are instantiated and usable in the LLM Eval function. The models are also directly callable with strings.
model = OpenAIModel(model_name="gpt-4",temperature=0.6)
model("What is the largest costal city in France?")
The above diagram shows examples of different environments the Eval harness is desinged to run. The benchmarking environment is designed to enable the testing of the Eval model & Eval template performance against a designed set of datasets.
The above approach allows us to compare models easily in an understandable format: