Offline Evaluations
Last updated
Was this helpful?
Last updated
Was this helpful?
Evaluations are essential to understanding how well your model is performing in real-world scenarios, allowing you to identify strengths, weaknesses, and areas of improvement.
Offline evaluations are run as code and then sent back to Arize using .
This guide assumes you have and are looking to run an evaluation to measure your application performance.
Once you have traces in Arize, you can visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy paste to your evaluator.
Ensure you have your OpenAI API keys setup correctly for your OpenAI model.
You can use the code below to check which attributes are in the traces in your dataframe.
Use the code below to set the input and output variables needed for the prompt above.
The evals_dataframe
requires four columns, which should be auto-generated for you based on the evaluation you ran using Phoenix. The <eval_name> must be alphanumeric and cannot have hyphens or spaces.
eval.<eval_name>.label
eval.<eval_name>.score
eval.<eval_name>.explanation
context.span_id
An example evaluation data dictionary would look like:
Here is sample code to run your evaluation and log it in real-time in the Arize platform.
To add evaluations you can set up as a task to run automatically, or you can follow the steps below to generate evaluations and log them to Arize:
Create a prompt template for the LLM to judge the quality of your responses. You can utilize any of the or you can create your own. Below is an example which judges the positivity or negativity of the LLM output.
Notice the variables in brackets for {input}
and {output}
above. You will need to set those variables appropriately for the dataframe so you can run your custom template. We use as a set of conventions (complementary to ) to trace AI applications. This means depending on the provider you are using, the attributes of the trace will be different.
Use the function to run the evaluation using your custom template. You will be using the dataframe from the traces you generated above.
If you'd like more information, see our detailed guide on . You can also use our for evaluating hallucination, toxicity, retrieval, etc.
Use the function as part of our Python SDK to attach evaluations you've run to traces. The code below assumes that you have already completed an evaluation run, and you have the evals_dataframe
object. It also assumes you have a traces_dataframe
object to get the span_id
that you need to attach the evals.