Agent Evaluation
Last updated
Was this helpful?
Last updated
Was this helpful?
Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. For one, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? For another, dad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated?
This page will walk you through a framework for navigating these pitfalls.
An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, we must evaluate each of these components.
We've built evaluation templates for every step:
You can evaluate the individual skills and response using normal LLM evaluation strategies, such as , , , or .
Read more to see the breakdown of each component.
Routers are one of the most common components of agents. While not every agent has a specific router node, function, or step, all agents have some method that chooses the next step to take. Routers and routing logic can be powered by intent classifiers, rules-based code, or most often, LLMs that use function calling.
To evaluate a router or router logic, you need to check:
Whether the router chose the correct next step to take, or function to call.
Whether the router extracted the correct parameters to pass on to that next step.
Whether the router properly handles edge cases, such as missing context, missing parameters, or cases where multiple functions should be called concurrently.
Take a travel agent router for example.
User Input: Help me find a flight from SF on 5/15
Router function call: flight-search(date="5/15", departure_city="SF", destination_city="")
Function choice
✅
Parameter extraction
❌
For more complex agents, it may be necessary to first have the agent plan out its intended path ahead of time. This approach can help avoid unnecessary tool calls, or endless repeating loops as the agent bounces between the same steps.
For agents that use this approach, a common evaluation metric is the quality of the plan generated by the agent. This "quality" metric can either take the form of a single overall evaluation, or a set of smaller ones, but either way, should answer:
Does the plan include only skills that are valid?
Are Z skills sufficient to accomplish this task?
Will Y plan accomplish this task given Z skills?
Is this the shortest plan to accomplish this task?
Given the more qualitative nature of these evaluations, they are usually performed by an LLM Judge.
Skills are the individual logic blocks, workflows, or chains that an agent can call on. For example, a RAG retriever skill, or a skill to all a specific API. Skills may be written and defined by the agent's designer, however increasingly skills may be outside services connect to via protocols like Anthropic's MCP.
You can evaluate skills using standard LLM or code evaluations. Since you are separately evaluating the router, you can evaluate skills "in a vacuum". You can assume that the skill was chosen correctly, and the parameters were properly defined, and can focus on whether the skill itself performed correctly.
Some common skill evals are:
Skills can be evaluated by LLM Judges, comparing against ground truth, or in code - depending on the skill.
Agent memory is used to store state between different components of an agent. You may store retrieved context, config variables, or any other info in agent memory. However, the most common information stored in agent memory is a long of the previous steps the agent has taken, typically formatted as LLM messages.
These messages form the best data to evaluate the agent's path.
The main questions that path evaluations try to answer are:
Did the agent go off the rails and onto the wrong pathway?
Does it get stuck in an infinite loop?
Does it choose the right sequence of steps to take given a whole agent pathway for a single action?
One type of path evaluation is measuring agent convergence. This is a numerical value, which is the length of the optimal path / length of the average path for similar queries.
Reflection allows you to evaluate your agents at runtime to enhance their quality. Before declaring a task complete, a plan devised, or an answer generated, ask the agent to reflect on the outcome. If the task isn't accomplished to the standard you want, retry.
See our Agent Reflection evaluation template for a more specific example.
Through a combination of the evaluations above, you can get a far more accurate picture of how your agent is performing.
See our for an implementation example.
See our for a specific example.
and for RAG skills
and
See our for a specific example.
See our for a specific example.
For an example of using these evals in combination, see