Ragas
Phoenix and Ragas work hand-in-hand
Last updated
Was this helpful?
Phoenix and Ragas work hand-in-hand
Last updated
Was this helpful?
is a library that provides robust evaluation metrics for LLM applications, making it easy to assess quality. When integrated with Phoenix, it enriches your experiments with metrics like goal accuracy and tool call accuracy—helping you evaluate performance more effectively and track improvements over time.
This guide will walk you through the process of creating and evaluating agents using Ragas and Arize Phoenix. We'll cover the following steps:
Build a customer support agent with the OpenAI Agents SDK
Trace agent activity to monitor interactions
Generate a benchmark dataset for performance analysis
Evaluate agent performance using Ragas
We will walk through the key steps in the documentation below. Check out the full tutorial here:
Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool. We'll use the Runner
class to run the agent and get the final output.
Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:
Tool Call Accuracy - Did our agent choose the right tool with the right arguments?
Agent Goal Accuracy - Did our agent accomplish the stated goal and get to the right outcome?
We'll import both metrics we're measuring from Ragas, and use the multi_turn_ascore(sample)
to get the results. The AgentGoalAccuracyWithReference
metric compares the final output to the reference to see if the goal was accomplished. The ToolCallAccuracy
metric compares the tool call to the reference tool call to see if the tool call was made correctly.
In the notebook, we also define the helper function conversation_to_ragas_sample
which converts the agent messages into a format that Ragas can use.
The following code snippets define our task function and evaluators.
Once we've generated a dataset of questions, we can use our experiments feature to track changes across models, prompts, parameters for the agent.
Finally, we run our experiment and view the results in Phoenix.