Tasks (Ongoing Evaluation)
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
As your application scales and the number of production logs increases, it can be cumbersome to manage manually your data. Tasks let you create and run automated actions on your LLM spans.
An example task is an LLM evaluation that runs on a specific set of your traces. You can select which traces based on filters and a sampling rate. Normally you would need to run these evaluators in code yourself and re-upload the results into Arize.
For development, you can automatically run an evaluation on every trace that doesn't have an evaluation yet. In production, you can sample a set of your traffic to run evaluations for monitoring. Our evaluation scripts run every 5 minutes.
To get started:
Select your traces
Select your LLM evaluator
Create your evaluation template
Select which model you are trying to evaluate, and how much traffic you want to sample. In low traffic or during development, you can sample every trace, but in a production application, you likely don’t want to run an evaluation on every user message.
Currently we support OpenAI, but we are working on adding support for as many LLM providers, including AzureOpenAI, Bedrock, and more.
We recommend using the latest GPT model at temperature 0, which gives the most accurate results in our evaluation tests.
You can write your own custom evaluation prompt template, or use one of our pre-tested evaluation templates. Within the template, you’ll need to pipe in the exact prompt template variables so the LLM knows how to evaluate the span using your own data.
The example here for User Frustration uses {attributes.llm.input_messages.0.message.content}
.
This is from our semantic conventions in openinference, which groups all LLM calls together into a standardized format. Here are some examples of ways to pipe in the right spans from your trace into the evaluation template: