Trace and Evaluate Agents

Follow an end to end example of tracing and evaluating an agent

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

High Level Concepts

In this example, we are building a customer support agent, which takes an input question from customers and decides what to do. In this example, we are deciding between searching for order information or answering the question.

Let's break this down further: We are taking the user input and passing it to a router / planner prompt template. This template then decides which function call or agent skill to use. Often after calling a function, it goes back to the router template to decide on the next step, which could involve calling another agent skill.

Trace your agent

We have auto-instrumentation for function calling and structured outputs across almost every LLM provider here.

We also support tracing for common frameworks using auto-instrumentation such as LangGraph, LlamaIndex Workflows, CrewAI, and AutoGen.

To trace a simple agent with function calling, you can use arize-otel, our convenience package for setting up OTEL tracing along with openinference auto-instrumentation, which maps LLM metadata to a standardized set of trace and span attributes.

Here's some sample code on logging all OpenAI calls to Arize.

# Import open-telemetry dependencies
from arize.otel import register

# Setup OTEL via our convenience function
tracer_provider = register(
    space_id=getpass("Enter your space ID"),
    api_key=getpass("Enter your API Key"),
    project_name="agents-cookbook",
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor

# Finish automatic instrumentation
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Let's create the foundation for our customer support agent. We have 2 functions that we define below: product_search, and track_package .

tools = [
    {
        "type": "function",
        "function": {
            "name": "product_search",
            "description": "Search for products based on criteria.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query string.",
                    },
                    "category": {
                        "type": "string",
                        "description": "The category to filter the search.",
                    },
                    "min_price": {
                        "type": "number",
                        "description": "The minimum price of the products to search.",
                        "default": 0,
                    },
                    "max_price": {
                        "type": "number",
                        "description": "The maximum price of the products to search.",
                    },
                    "page": {
                        "type": "integer",
                        "description": "The page number for pagination.",
                        "default": 1,
                    },
                    "page_size": {
                        "type": "integer",
                        "description": "The number of results per page.",
                        "default": 20,
                    },
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package.",
                    }
                },
                "required": ["tracking_number"],
            }
        }
    }
]

We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions, and returns the tool calls. Notice that we label tool_choice as required, so a function will always be returned.

import os
import openai
client = openai.Client()


def run_prompt(input):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        tools=tools,
        tool_choice="required",
        messages=[
            {
                "role": "system",
                "content": " ",
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )

    if (
        hasattr(response.choices[0].message, "tool_calls")
        and response.choices[0].message.tool_calls is not None
        and len(response.choices[0].message.tool_calls) > 0
    ):
        return response.choices[0].message.tool_calls
    else:
        return []

Let's test it and see if it returns the right function! If we ask a question about specific products, we'll get a response that will call the product search function.

run_prompt("I'm interested in energy-efficient appliances, but I'm not sure which ones are best for a small home office. Can you help?")
ChatCompletionMessageToolCall(
    id='call_uAdifautsiR8VHnLRoGivjuk',
    function=Function(
        name='product_search',
        type='function',
        arguments='{"query":"energy-efficient appliances",
                    "category":"home office",
                    "page":1,
                    "page_size":5
                    }',
)

This results in a trace that looks like the following:

Evaluate your agent

Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

Here, we are defining our evaluation templates to judge whether the router selected a function correctly. We also add two more evaluators on whether it selected the right function, and whether it filled the arguments correctly in our colab.

ROUTER_EVAL_TEMPLATE = """You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the Question above to the response. You must determine whether the reponse
decided to call the correct function.

Your response must be single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word. "incorrect" means that the agent should have made function call instead of responding directly and did not, or the function call chosen was the incorrect one."correct" means the selected function would correctly and fully answer the user's question.

Here is more information on each function:
- product_search: Search for products based on criteria.
- track_package: Track the status of a package based on the tracking number."""

We can evaluate our outputs using Phoenix's llm_classify function. I have a placeholder response_df which would be generated using the datasets code above.

from phoenix.evals import OpenAIModel, llm_classify

rails = ["incorrect", "correct"]

router_eval_df = llm_classify(
    dataframe=response_df,
    template=ROUTER_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=rails,
    provide_explanation=True,
    concurrency=4,
)

If you inspect each dataframe generated by llm_classify , you'll get a table of evaluation results. Below is a formatted example for router_eval_df merged with the question and response.

Question
Response
Label
Explanation

Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen setup?

ChatCompletionMessageToolCall

name = 'product_search'

type = 'function'

arguments =

{

'query': 'eco-friendly gadgets',

'category': 'kitchen'

}

correct

The user's question is about finding eco-friendly gadgets suitable for a modern kitchen setup. The function call made is 'product_search' with the query 'eco-friendly gadgets' and category 'kitchen', which is appropriate for searching products based on the user's criteria. Therefore, the function call is correct.

Next steps

We covered very simple examples of tracing and evaluating an agent that uses function calling to route user requests and take actions in your application. As you build more capabilities into your agent, you'll need more advanced tooling to measure and improve performance.

You can do all of the following in Arize:

  1. Manually create tool spans which log your function calling inputs, latency, and outputs (more info on tool spans, colab example).

  2. Evaluate your agent across multiple levels, not just the router prompt (more info on evaluations).

  3. Create experiments to track changes across models, prompts, and parameters (more info on experiments).

You can also follow our end to end colab example which covers the above with code.

Follow an end to end example of tracing and evaluating an agent
Follow an example of manually instrumenting an agent if you need additional logging

Last updated

Copyright © 2023 Arize AI, Inc