Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
  • Featured Tutorials
  • Agent Cookbooks
  • Agent Demos
  • 🔭Tracing
  • Cookbooks
  • Structured Data Extraction
  • 🗒️Prompt Engineering
    • Few Shot Prompting
    • ReAct Prompting
    • Chain-of-Thought Prompting
    • Prompt Optimization
    • LLM as a Judge Prompt Optimization
  • 🗄️Datasets & Experiments
    • Cookbooks
    • Summarization
    • Text2SQL
  • 🧠Evaluation
    • Cookbooks
    • Evaluate RAG
    • Evaluate an Agent
    • OpenAI Agents SDK Cookbook
  • 🔍Retrieval & Inferences
    • Cookbooks
    • Embeddings Analysis
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page
  • Set up Dependencies and Keys
  • Load Dataset Into Phoenix
  • Define Tools
  • Initial Prompt
  • ReAct Prompt

Was this helpful?

  1. Prompt Engineering

ReAct Prompting

PreviousFew Shot PromptingNextChain-of-Thought Prompting

Last updated 1 month ago

Was this helpful?

ReAct (Reasoning + Acting) is a prompting technique that enables LLMs to think step-by-step before taking action. Unlike traditional prompting, where a model directly provides an answer, ReAct prompts guide the model to reason through a problem first, then decide which tools or actions are necessary to reach the best solution.

ReAct is ideal for situations that require multi-step problem-solving with external tools. It also improves transparency by clearly showing the reasoning behind each tool choice, making it easier to understand and refine the model's actions.

In this tutorial, you will:

  • Learn how to craft prompts, tools, and evaluators in Phoenix

  • Refine your prompts to understand the power of ReAct prompting

  • Leverage Phoenix and LLM as a Judge techniques to evaluate accuracy at each step, gaining insight into the model's thought process.

  • Learn how to apply ReAct prompting in real-world scenarios for improved task execution and problem-solving.

⚠️ You'll need an OpenAI Key for this tutorial.

Let’s get started! 🚀

Set up Dependencies and Keys

!pip install -qqq "arize-phoenix>=8.0.0" datasets openinference-instrumentation-openai
import os
from getpass import getpass

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
import nest_asyncio
import pandas as pd
from openai import OpenAI
from openai.types.chat.completion_create_params import CompletionCreateParamsBase
from openinference.instrumentation.openai import OpenAIInstrumentor

import phoenix as px
from phoenix.client import Client as PhoenixClient
from phoenix.client.types import PromptVersion
from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    OpenAIModel,
    llm_classify,
)
from phoenix.experiments import run_experiment
from phoenix.otel import register

nest_asyncio.apply()

Instrument Application

tracer_provider = register(
    project_name="ReAct-examples", endpoint="https://app.phoenix.arize.com/v1/traces"
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Load Dataset Into Phoenix

This dataset contains 20 customer service questions that a customer might ask a store's chatbot. As we dive into ReAct prompting, we'll use these questions to guide the LLM in selecting the appropriate tools.

Here, we also import the Phoenix Client, which enables us to create and modify prompts directly within the notebook while seamlessly syncing changes to the Phoenix UI.

After running this cell, the dataset should will be under the Datasets tab in Phoenix.

from datasets import load_dataset

ds = load_dataset("syeddula/customer_questions")["train"]
ds = ds.to_pandas()
ds.head()
import uuid

unique_id = uuid.uuid4()

# Upload the dataset to Phoenix
dataset = px.Client().upload_dataset(
    dataframe=ds,
    input_keys=["Questions"],
    dataset_name=f"customer-questions-{unique_id}",
)

Define Tools

Next, let's define the tools available for the LLM to use. We have five tools at our disposal, each serving a specific purpose: Product Comparison, Product Details, Discounts, Customer Support, and Track Package.

Depending on the customer's question, the LLM will determine the optimal sequence of tools to use.

tools = [
    {
        "type": "function",
        "function": {
            "name": "product_comparison",
            "description": "Compare features of two products.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_a_id": {
                        "type": "string",
                        "description": "The unique identifier of Product A.",
                    },
                    "product_b_id": {
                        "type": "string",
                        "description": "The unique identifier of Product B.",
                    },
                },
                "required": ["product_a_id", "product_b_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "product_details",
            "description": "Get detailed features on one product.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The unique identifier of the Product.",
                    }
                },
                "required": ["product_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "apply_discount_code",
            "description": "Checks for discounts and promotions. Applies a discount code to an order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "integer",
                        "description": "The unique identifier of the order.",
                    },
                    "discount_code": {
                        "type": "string",
                        "description": "The discount code to apply.",
                    },
                },
                "required": ["order_id", "discount_code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "customer_support",
            "description": "Get contact information for customer support regarding an issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "issue_type": {
                        "type": "string",
                        "description": "The type of issue (e.g., billing, technical support).",
                    }
                },
                "required": ["issue_type"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package.",
                    }
                },
                "required": ["tracking_number"],
            },
        },
    },
]

Initial Prompt

Let's start by defining a simple prompt that instructs the system to utilize the available tools to answer the questions. The choice of which tools to use, and how to apply them, is left to the model's discretion based on the context of each customer query.

params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="auto",
    messages=[
        {
            "role": "system",
            "content": """You are a helpful customer service agent.
            Your task is to determine the best tools to use to answer a customer's question.
            Output the tools and pick 3 tools at maximum.
            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support",
    version=PromptVersion.from_openai(params),
)

At this stage, this initial prompt is now available in Phoenix under the Prompt tab. Any modifications made to the prompt moving forward will be tracked under Versions, allowing you to monitor and compare changes over time.

Prompts in Phoenix store more than just text—they also include key details such as the prompt template, model configurations, and response format, ensuring a structured and consistent approach to generating outputs.

This prompt is provided to the LLM-as-Judge model, which takes in both the user's query and the tools the system has selected. The model then uses reasoning to assess how effectively the chosen tools addressed the query, providing an explanation for its evaluation.

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would reasonably help answer the question.
The tool calls have been generated by a separate agent, chosen from the list of
tools provided below. Your job is to decide whether that agent's response was relevant to solving the customer's question.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_calls}
    [END DATA]

Your response must be one of the following:
1. **"correct"** – The chosen tool(s) would sufficiently answer the question.
2. **"mostly_correct"** – The tool(s) are helpful, but a better selection could have been made (at most 1 missing or unnecessary tool).
3. **"incorrect"** – The tool(s) would not meaningfully help answer the question.

Explain why you made your choice.

    [Tool Definitions]:
    product_comparison: Compare features of two products.
    product_details: Get detailed features on one product.
    apply_discount_code: Applies a discount code to an order.
    customer_support: Get contact information for customer support regarding an issue.
    track_package: Track the status of a package based on the tracking number.
"""

In the following cells, we will define a task for the experiment. Then, in the evaluate_response function, we define our LLM as a Judge evaluator. Finally, we run our experiment.

def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Experiment

initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="initial-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

After running our experiment and evaluation, we can dive deeper into the results. By clicking into the experiment, we can explore the tools that the LLM selected for the specific input. Next, if we click on the trace for the evaluation, we can see the reasoning behind the score assigned by LLM as a Judge for the output.

ReAct Prompt

Next, we iterate on our system prompt using ReAct Prompting techniques. We emphasize that the model should think through the problem step-by-step, break it down logically, and then determine which tools to use and in what order. The model is instructed to output the relevant tools along with their corresponding parameters.

This approach differs from our initial prompt because it encourages reasoning before action, guiding the model to select the best tools and parameters based on the specific context of the query, rather than simply using predefined actions.

params = CompletionCreateParamsBase(
    model="gpt-4",
    temperature=0.5,
    tools=tools,
    tool_choice="required",
    messages=[
        {
            "role": "system",
            "content": """
              You are a helpful customer service agent. Carefully analyze the customer’s question to fully understand their request.
              Step 1: Think step-by-step. Identify the key pieces of information needed to answer the question. Consider any dependencies between these pieces of information.
              Step 2: Decide which tools to use. Choose up to 3 tools that will best retrieve the required information. If multiple tools are needed, determine the correct order to call them.
              Step 3: Output the chosen tools and any relevant parameters.

            """,
        },
        {"role": "user", "content": "{{questions}}"},
    ],
)

prompt_identifier = "customer-support"

prompt = PhoenixClient().prompts.create(
    name=prompt_identifier,
    prompt_description="Customer Support ReAct Prompt",
    version=PromptVersion.from_openai(params),
)

In the Prompts tab, you will see the updated prompt. As you iterate, you can build a version history.

Just like above, we define our task, construct the evaluator, and run the experiment.

def prompt_task(input):
    client = OpenAI()
    resp = client.chat.completions.create(
        **prompt.format(variables={"questions": input["Questions"]})
    )
    return resp


def evaluate_response(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "tool_calls": output}]),
        template=TOOL_CALLING_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-3.5-turbo"),
        rails=list(TOOL_CALLING_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Experiment

initial_experiment = run_experiment(
    dataset,
    task=prompt_task,
    evaluators=[evaluate_response],
    experiment_description="Customer Support Prompt",
    experiment_name="improved-prompt",
    experiment_metadata={"prompt": "prompt_id=" + prompt.id},
)

With our updated ReAct prompt, we can observe that the LLM as a Judge Evaluator rated more outputs as correct. By clicking into the traces, we can gain insights into the reasons behind this improvement. By prompting our LLM to be more thoughtful and purposeful, we can see the reasoning and acting aspects of ReAct.

You can explore the evaluators outputs to better understand the improvements in detail.

Keep in mind that results may vary due to randomness and the model's non-deterministic behavior.

To refine and test these prompts against other datasets, experiment with alternative techniques like Chain of Thought (CoT) prompting to assess how they complement or contrast with ReAct in your specific use cases. With Phoenix, you can seamlessly integrate this process into your workflow using both the TypeScript and Python Clients.

Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also if you'd prefer.

Next, we will define the Tool Calling Prompt Template. In this step, we use to evaluate the output. LLM as a Judge is a technique where one LLM assesses the performance of another LLM.

From here, you can check out more , and if you haven't already, ⭐️

🗒️
connect to a self-hosted Phoenix instance
LLM as a Judge
examples on Phoenix
please give us a star on GitHub!
Google Colab
Logo