is a library that provides robust evaluation metrics for LLM applications, making it easy to assess quality. When integrated with Arize, it enriches your experiments with metrics like goal accuracy and tool call accuracy—helping you evaluate performance more effectively and track improvements over time.
This guide will walk you through the process of creating and evaluating agents using Ragas and Arize. We'll cover the following steps:
Build a customer support agent with the OpenAI Agents SDK
Trace agent activity to monitor interactions
Generate a benchmark dataset for performance analysis
Evaluate agent performance using Ragas
We will walk through the key steps in the documentation below. Check out the full tutorial here:
Creating the Agent
Here we've setup a basic agent that can solve math problems. We have a function tool that can solve math equations, and an agent that can use this tool. We'll use the Runner class to run the agent and get the final output.
from agents import Runner, function_tool
@function_tool
def solve_equation(equation: str) -> str:
"""Use python to evaluate the math equation, instead of thinking about it yourself.
Args:"
equation: string which to pass into eval() in python
"""
return str(eval(equation))
from agents import Agent
agent = Agent(
name="Math Solver",
instructions="You solve math problems by evaluating them with python and returning the result",
tools=[solve_equation],
)
Evaluating the Agent
Agents can go awry for a variety of reasons. We can use Ragas to evaluate whether the agent responded correctly. Two Ragas measurements help with this:
Tool Call Accuracy - Did our agent choose the right tool with the right arguments?
Agent Goal Accuracy - Did our agent accomplish the stated goal and get to the right outcome?
We'll import both metrics we're measuring from Ragas, and use the multi_turn_ascore(sample) to get the results. The AgentGoalAccuracyWithReference metric compares the final output to the reference to see if the goal was accomplished. The ToolCallAccuracy metric compares the tool call to the reference tool call to see if the tool call was made correctly.
In the notebook, we also define the helper function conversation_to_ragas_sample which converts the agent messages into a format that Ragas can use.
The following code snippets define our task function and evaluators.
import asyncio
from agents import Runner
async def solve_math_problem(input):
if isinstance(input, dict):
input = next(iter(input.values()))
result = await Runner.run(agent, input)
return {"final_output": result.final_output, "messages": result.to_input_list()}
Once we've generated a dataset of questions, we can use our experiments feature to track changes across models, prompts, parameters for the agent.
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE
client = ArizeDatasetsClient(api_key= os.environ.get("ARIZE_API_KEY"), developer_key= developer_key)
dataset_df = pd.DataFrame({
"id": [f"id_{i}" for i in range(len(conversations))],
"question": [conv["question"] for conv in conversations],
"attributes.input.value": [conv["question"] for conv in conversations],
"attributes.output.value": [conv["final_output"] for conv in conversations],
})
dataset = client.create_dataset(
space_id=os.environ.get("SPACE_ID"),
dataset_name="math-qestions",
data = dataset_df,
dataset_type = GENERATIVE,
)
Finally, we run our experiment and view the results in Arize.