Agent Tool Calling

This article covers evaluating how well your agent selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.

Agents are often heavy users of tool calling, which can also be an approximation for workflow selection or dialog tree management. Given a set of tools and an input, which one should be chosen?

Depending on your use case you may want to expand your evaluation to include other cases, such as:

  • Missing context, short context, and long context

  • No functions should be called, one function should be called, or multiple functions should be called

  • Functions are available, but they are the wrong ones

  • Vague or opaque parameters in query, vs. very specific parameters in query

  • Single turn vs. multi-turn conversation pathways

Prompt Template

In this prompt template, we are testing single-turn, no context, one function call for an agent router and evaluating the whole tool call.

You can find more specific, smaller evaluation tasks for agent tool selection and agent parameter extraction.

You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.

"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

[TOOL DEFINITIONS START]
{tool_definitions}
[TOOL DEFINITIONS END]

How to Run:

from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

df_in = pd.DataFrame(
    {"question": ["<INSERT QUESTION>"], "tool_call": ["<INSERT TOOL CALL>"]}
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json_tools),
    model=model,
    rails=rails,
    provide_explanation=True
)

Last updated

Was this helpful?