Run evaluations in the UI
Last updated
Was this helpful?
Last updated
Was this helpful?
This guide will show you how to setup an online evaluation in the Arize UI, which runs on your data automatically as your LLM application is used. You can setup LLM as a judge evaluators or code evaluators, which run in a python container against your data.
Follow this guide to setup code evaluations. The setup is quite similar to LLM evaluations.
Navigate to the tasks page and click "new task".
Choose which traces you want to evaluate, and how often you want to run it. You can run it once to fill historical data, or you can run it continuously against new data.
Task Name: Choose a unique name for the task, which will be displayed on the Task Listing Page.
Project: Select the project where this task will be used to label traces with evaluation labels.
Sampling Rate: Define the sampling rate to determine what percentage of the incoming data will be used for the task. For instance, setting it to 50% means the task will run on half of the incoming traces.
Filters: Apply specific filters to ensure you're tagging the right data. For example, you can filter for LLM spans or use criteria like token count or input value.
Schedule Run: Choose how the task should operate, either running continuously on new data or as a one-time process on historical data. The options include:
Run once on historical data: Use this to backfill evaluation labels on past traces or to test the task. You can set a date range and specify the maximum number of spans.
Run continuously on new incoming data: This option sets the task to automatically label new data within 5 minutes of its arrival in the platform. The task will additionally label the past 24 hours of data.
Choose the LLM provider, model, and other parameters for your LLM evaluation.
Provider: Choose a provider for your LLM model, such as OpenAI or Vertex AI. Only models that have been configured on the integrations page will be available.
Model: From the drop-down list, select the specific model associated with your chosen provider.
Temperature: Set the temperature for the model, which controls the randomness of its outputs. Higher temperatures produce more varied and creative responses, while lower temperatures result in more deterministic outputs. For guidance on setting this value, refer to the model provider's API documentation.
Select one of our pre-built evaluators, or use copilot to write one on your behalf! You can also write your own evaluation template or write python code to evaluate your LLM outputs.
Select an Eval: From the drop-down menu, you can choose a pre-built evaluator. When selected, the Eval Column Name, Eval Template, and Output Rails fields will be automatically filled. These off-the-shelf evaluators have been thoroughly tested to ensure high performance at scale. If you need a more customized evaluation, you can modify the pre-populated template or create an entirely new one tailored to your application.
Eval Column Name: Provide a unique column name for the evaluator in plaintext. Ensure that this name is distinct from other evaluators across all tasks.
Eval Template: Write an LLM template that will guide the LLM in evaluating the data. Use bracket notation to reference specific dataframe fields, such as {attributes.llm.input_messages.0.message.content}
, which follow the openinference semantic conventions for organizing LLM calls. The task will use this template to run the llm_classify function from Phoenix under the hood.
Output Rails: Define an array of strings representing the possible output classes the model can predict. If there are two options, the first value will be mapped to 1
, and the second will be mapped to 0
, which allows for aggregate scoring. If more than two output rails are provided, aggregate scores will not be available.
Generate Explanations: Set this to True
if you want the LLM Judge to provide explanations for the labels it assigns. For example, it can explain why it labeled an LLM output as a "hallucination" or "factual."
Function Calling: Set this to True
to enable function calling, if the feature is available for the selected LLM. With function calling, the LLM is instructed to return its output as a structured JSON object, making it easier to parse and handle. This setting corresponds to the parameter use_function_calling_if_available
in the llm_classify API.
Once your task is successfully created, a green pop-up notification will appear! Navigate to the tracing page to view the evaluation labels generated by your newly created task.
Need help creating an eval? Use ✨Copilot to easily generate your evaluation template. Simply click the Copilot button within the task modal, and describe the evaluation you want to create or its goals. Copilot will handle the rest—naming the evaluation, generating a prompt with the necessary variables, and including the appropriate guardrails.