Agent Function Calling Eval

The Agent Function Call eval can be used to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.

Function Calling Eval Template

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""

Running an Agent Eval using the Function Calling Template

from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Parameters:

df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:

Parameter Extraction Only

This template instead evaluates only the parameter extraction step of a router:

You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct", "incorrect", or "not-applicable",
and should not contain any text or characters aside from that word.

"correct" means the function call parameters match the JSON below and provides only relevant information.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question. You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"not-applicable" means that response was not a function call.

Here is more information on each function:
{function_defintions}

PreviousSQL Generation Eval NextAgent Path Convergence

Last updated 1 month ago

Was this helpful?

Function Calling Eval Template

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""

We are continually iterating our templates, view the most up-to-date template .

Running an Agent Eval using the Function Calling Template

from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

# the rails object will be used to snap responses to "correct" 
# or "incorrect"
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

# Loop through the specified dataframe and run each row 
# through the specified model and prompt. llm_classify
# will run requests concurrently to improve performance.
tool_call_evaluations = llm_classify(
    dataframe=df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True
)

Parameters:

df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:

question - the query made to the model. If you've to evaluate, this will the llm.input_messages column in your exported data.
tool_call - information on the tool called and parameters included. If you've to evaluate, this will be the llm.function_call column in your exported data.

Parameter Extraction Only

This template instead evaluates only the parameter extraction step of a router:

You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct", "incorrect", or "not-applicable",
and should not contain any text or characters aside from that word.

"correct" means the function call parameters match the JSON below and provides only relevant information.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question. You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"not-applicable" means that response was not a function call.

Here is more information on each function:
{function_defintions}