Trace and Evaluate Agents
This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:
High Level Concepts
In this example, we are building a customer support agent, which takes an input question from customers and decides what to do. In this example, we are deciding between searching for order information or answering the question.
Let's break this down further: We are taking the user input and passing it to a router / planner prompt template. This template then decides which function call or agent skill to use. Often after calling a function, it goes back to the router template to decide on the next step, which could involve calling another agent skill.
Trace your agent
We have auto-instrumentation for function calling and structured outputs across almost every LLM provider here.
We also support tracing for common frameworks using auto-instrumentation such as LangGraph, LlamaIndex Workflows, CrewAI, and AutoGen.
To trace a simple agent with function calling, you can use arize-otel, our convenience package for setting up OTEL tracing along with openinference auto-instrumentation, which maps LLM metadata to a standardized set of trace and span attributes.
Here's some sample code on logging all OpenAI calls to Arize.
Let's create the foundation for our customer support agent. We have 2 functions that we define below: product_search
, and track_package
.
We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions, and returns the tool calls. Notice that we label tool_choice
as required, so a function will always be returned.
Let's test it and see if it returns the right function! If we ask a question about specific products, we'll get a response that will call the product search function.
This results in a trace that looks like the following:
Evaluate your agent
Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.
Here, we are defining our evaluation templates to judge whether the router selected a function correctly. We also add two more evaluators on whether it selected the right function, and whether it filled the arguments correctly in our colab.
We can evaluate our outputs using Phoenix's llm_classify function. I have a placeholder response_df
which would be generated using the datasets code above.
If you inspect each dataframe generated by llm_classify
, you'll get a table of evaluation results. Below is a formatted example for router_eval_df
merged with the question and response.
Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen setup?
ChatCompletionMessageToolCall
name = 'product_search'
type = 'function'
arguments =
{
'query': 'eco-friendly gadgets',
'category': 'kitchen'
}
correct
The user's question is about finding eco-friendly gadgets suitable for a modern kitchen setup. The function call made is 'product_search
' with the query 'eco-friendly gadgets
' and category 'kitchen
', which is appropriate for searching products based on the user's criteria. Therefore, the function call is correct.
Next steps
We covered very simple examples of tracing and evaluating an agent that uses function calling to route user requests and take actions in your application. As you build more capabilities into your agent, you'll need more advanced tooling to measure and improve performance.
You can do all of the following in Arize:
Manually create tool spans which log your function calling inputs, latency, and outputs (more info on tool spans, colab example).
Evaluate your agent across multiple levels, not just the router prompt (more info on evaluations).
Create experiments to track changes across models, prompts, and parameters (more info on experiments).
You can also follow our end to end colab example which covers the above with code.
Last updated