Online Evaluations
Learn how to setup tasks to run ongoing evaluations
As your application scales and the number of production logs increases, it can be cumbersome to manage manually your data. Tasks let you create and run automated actions on your LLM spans.
An example task is an LLM evaluation that runs on a specific set of your traces. You can select which traces based on filters and a sampling rate. Normally you would need to run these evaluators in code yourself and re-upload the results into Arize.
For development, you can automatically run an evaluation on every trace that doesn't have an evaluation yet. In production, you can sample a set of your traffic to run evaluations for monitoring. Our evaluation scripts run every 5 minutes.
Online evaluation setup
To get started:
Select your traces
Select your LLM evaluator
Create your evaluation template
Select your traces
Select which model you are trying to evaluate, and how much traffic you want to sample. In low traffic or during development, you can sample every trace, but in a production application, you likely don’t want to run an evaluation on every user message.
Select your LLM evaluator
Currently we support evaluations with the following LLMs: OpenAI, AzureOpenAI, Bedrock, and VertexAI, and soon we will be adding the ability to specify your own custom LLM model.
We recommend using the latest GPT model at temperature 0, which gives the most accurate results in our evaluation tests.
Create your evaluation template
Need help creating an eval? ✨ Use Copilot to write your evaluation template for you.
You can write your own custom evaluation prompt template, or use one of our pre-tested evaluation templates. Within the template, you’ll need to pipe in the exact prompt template variables so the LLM knows how to evaluate the span using your own data.
The example here for User Frustration uses {attributes.llm.input_messages.0.message.content}
.
This is from our semantic conventions in openinference, which groups all LLM calls together into a standardized format. Here are some examples of ways to pipe in the right spans from your trace into the evaluation template:
Last updated