Online Evaluations

Learn how to setup tasks to run ongoing evaluations

As your application scales and the number of production logs increases, it can be cumbersome to manage manually your data. Tasks let you create and run automated actions on your LLM spans.

An example task is an LLM evaluation that runs on a specific set of your traces. You can select which traces based on filters and a sampling rate. Normally you would need to run these evaluators in code yourself and re-upload the results into Arize.

For development, you can automatically run an evaluation on every trace that doesn't have an evaluation yet. In production, you can sample a set of your traffic to run evaluations for monitoring. Our evaluation scripts run every 5 minutes.

Setting up an ongoing LLM evaluation

Online evaluation setup

To get started:

  1. Select your traces

  2. Select your LLM evaluator

  3. Create your evaluation template

Select your traces

Select which model you are trying to evaluate, and how much traffic you want to sample. In low traffic or during development, you can sample every trace, but in a production application, you likely don’t want to run an evaluation on every user message.

Select your LLM evaluator

Currently we support evaluations with the following LLMs: OpenAI, AzureOpenAI, Bedrock, and VertexAI, and soon we will be adding the ability to specify your own custom LLM model.

We recommend using the latest GPT model at temperature 0, which gives the most accurate results in our evaluation tests.

Create your evaluation template

Need help creating an eval? Use Copilot to write your evaluation template for you.

You can write your own custom evaluation prompt template, or use one of our pre-tested evaluation templates. Within the template, you’ll need to pipe in the exact prompt template variables so the LLM knows how to evaluate the span using your own data.

The example here for User Frustration uses {attributes.llm.input_messages.0.message.content}.

This is from our semantic conventions in openinference, which groups all LLM calls together into a standardized format. Here are some examples of ways to pipe in the right spans from your trace into the evaluation template:

# Get the content of the first input message
{attributes.llm.input_messages.0.message.content}

# Get the content of the system message and first user message
{attributes.llm.input_messages.0.message.content}
{attributes.llm.input_messages.1.message.content}

# Get the content of the output message
{attributes.llm.output_messages.0.message.content}

# Get the function and arguments in a function call in the output message
{attributes.llm.output_messages.0.message.tool_calls.tool_call.function.name}
{attributes.llm.output_messages.0.message.tool_calls.tool_call.function.arguments}

# Get the prompt variables from the LLM span
{attributes.llm.prompt_template.variables.<variable_name>}

# Get manually set metadata variables
{attributes.metadata.<variable_name>}

Last updated

Copyright © 2023 Arize AI, Inc