Run evaluations in the UI
Create background tasks to run evaluations without code
Last updated
Was this helpful?
Create background tasks to run evaluations without code
Last updated
Was this helpful?
This guide will show you how to setup an online evaluation in the Arize UI, which runs on your data automatically as your LLM application is used. You can setup LLM as a judge evaluators or code evaluators, which run in a python container against your data.
. The setup is quite similar to LLM evaluations.
Navigate to the tasks page and click "new task".
Choose which traces you want to evaluate, and how often you want to run it. You can run it once to backfill eval labels on historical data, or you can run it continuously against new data within 5 minutes of its arrival in the platform.
You can edit the output rails to constrain the generated eval labels. If there are two options, the first value will be mapped to 1
, and the second will be mapped to 0
, which allows for aggregate scoring.
Once your task is successfully created, a green pop-up notification will appear! Navigate to the tracing page to view the evaluation labels generated by your newly created task.
You can apply specific to tag the right data, and also define a sampling rate. For instance, setting it to 50% means the task will run on half of the incoming traces.
Choose the LLM provider, model, and other parameters for your LLM evaluation. Configure models on the to make them available.
Select one of our pre-built evaluators, or use copilot to write one on your behalf! You can also write your own evaluation template or write python code to evaluate your LLM outputs. You can use our pre-tested, as well.
If you are writing a custom eval template, use bracket notation to reference specific dataframe fields, such as {attributes.llm.input_messages.0.message.content}
, which follow the openinference . The task will use this template to run the function from Phoenix under the hood.
Need help creating an eval? Use ✨ to easily generate your evaluation template. Simply click the Copilot button within the task modal, and describe the evaluation you want to create or its goals. Copilot will handle the rest—naming the evaluation, generating a prompt with the necessary variables, and including the appropriate guardrails.