Online Evaluations (Tasks)
Last updated
Last updated
Copyright © 2023 Arize AI, Inc
As your application grows and the volume of production logs increases, manually managing the data can become challenging. Online Evaluations (Tasks) allow you to automatically tag new spans with evaluation labels as soon as the data arrives in the platform, streamlining the process.
Task Name: Choose a unique name for the task, which will be displayed on the Task Listing Page.
Project: Select the project where this task will be used to label traces with evaluation labels.
Sampling Rate: Define the sampling rate to determine what percentage of the incoming data will be used for the task. For instance, setting it to 50% means the task will run on half of the incoming traces.
Filters: Apply specific filters to ensure you're tagging the right data. For example, you can filter for LLM spans or use criteria like token count or input value.
Schedule Run: Choose how the task should operate, either running continuously on new data or as a one-time process on historical data. The options include:
Run once on historical data: Use this to backfill evaluation labels on past traces or to test the task. You can set a date range and specify the maximum number of spans.
Run continuously on new incoming data: This option sets the task to automatically label new data within 5 minutes of its arrival in the platform. The task will additionally label the past 24 hours of data.
Provider: Choose a provider for your LLM model, such as OpenAI or Vertex AI. Only models that have been configured on the integrations page will be available.
Model: From the drop-down list, select the specific model associated with your chosen provider.
Temperature: Set the temperature for the model, which controls the randomness of its outputs. Higher temperatures produce more varied and creative responses, while lower temperatures result in more deterministic outputs. For guidance on setting this value, refer to the model provider's API documentation.
Select an Eval: From the drop-down menu, you can choose a pre-built evaluator. When selected, the Eval Column Name, Eval Template, and Output Rails fields will be automatically filled. These off-the-shelf evaluators have been thoroughly tested to ensure high performance at scale. If you need a more customized evaluation, you can modify the pre-populated template or create an entirely new one tailored to your application.
Eval Column Name: Provide a unique column name for the evaluator in plaintext. Ensure that this name is distinct from other evaluators across all tasks.
Eval Template: Write an LLM template that will guide the LLM in evaluating the data. Use bracket notation to reference specific dataframe fields, such as {attributes.llm.input_messages.0.message.content}
, which follow the openinference semantic conventions for organizing LLM calls. The task will use this template to run the llm_classify function from Phoenix under the hood.
Output Rails: Define an array of strings representing the possible output classes the model can predict. If there are two options, the first value will be mapped to 1
, and the second will be mapped to 0
, which allows for aggregate scoring. If more than two output rails are provided, aggregate scores will not be available.
Generate Explanations: Set this to True
if you want the LLM Judge to provide explanations for the labels it assigns. For example, it can explain why it labeled an LLM output as a "hallucination" or "factual."
Function Calling: Set this to True
to enable function calling, if the feature is available for the selected LLM. With function calling, the LLM is instructed to return its output as a structured JSON object, making it easier to parse and handle. This setting corresponds to the parameter use_function_calling_if_available
in the llm_classify API.
Once your task is successfully created, a green pop-up notification will confirm the task creation and provide a link to the LLM tracing page. Navigate to the tracing page to view the evaluation labels generated by your newly created task. The name of your new task will also appear on the Task Listing Page below, allowing you to easily refer to the task details and logs.
From the Tasks page, click on a specific task to access its details and logs. The logs provide valuable information that can assist with debugging if any errors occur, allowing you to identify and resolve issues efficiently.
Need help creating an eval? Use ✨Copilot to easily generate your evaluation template. Simply click the Copilot button within the task modal, and describe the evaluation you want to create or its goals. Copilot will handle the rest—naming the evaluation, generating a prompt with the necessary variables, and including the appropriate guardrails.