The Agent Function Call eval can be used to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code.
Eval Prompt:
TOOL_CALLING_PROMPT_TEMPLATE ="""You are an evaluation assistant evaluating questions and tool calls todetermine whether the tool called would answer the question. The toolcalls have been generated by a separate agent, and chosen from the list oftools provided below. It is your job to decide whether that agent chosethe right tool to call. [BEGIN DATA] ************ [Question]: {question} ************ [Tool Called]: {tool_call} [END DATA]Your response must be single word, either "correct" or "incorrect",and should not contain any text or characters aside from that word."incorrect" means that the chosen tool would not answer the question,the tool includes information that is not presented in the question,or that the tool signature includes parameter values that don't matchthe formats specified in the tool signatures below."correct" means the correct tool call was chosen, the correct parameterswere extracted from the question, the tool call generated is runnable and correct,and that no outside information not present in the question was usedin the generated question. [Tool Definitions]: {tool_definitions}"""
We are continually iterating our templates, view the most up-to-date template on GitHub.
Example Code:
from phoenix.evals import ( TOOL_CALLING_PROMPT_RAILS_MAP, TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify,)# the rails object will be used to snap responses to "correct" # or "incorrect"rails =list(TOOL_CALLING_PROMPT_RAILS_MAP.values())model =OpenAIModel( model_name="gpt-4", temperature=0.0,)# Loop through the specified dataframe and run each row # through the specified model and prompt. llm_classify# will run requests concurrently to improve performance.tool_call_evaluations =llm_classify( dataframe=df, template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace("{tool_definitions}", json_tools), model=model, rails=rails, provide_explanation=True)
Parameters:
df - a dataframe of cases to evaluate. The dataframe must have these columns to match the default template:
tool_call - information on the tool called and parameters included. If you've exported spans from Phoenix to evaluate, this will be the llm.function_call column in your exported data.
question - the query made to the model. If you've to evaluate, this will the llm.input_messages column in your exported data.