Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.
In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:
Upload a dataset of examples containing articles and human-written reference summaries to Phoenix
Define an experiment task that summarizes a news article
Devise evaluators for length and ROUGE score
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs
⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.
from typing import Any, Dictimport nest_asyncioimport pandas as pdnest_asyncio.apply()# needed for concurrent evals in notebook environmentspd.set_option("display.max_colwidth", None)# display full cells of dataframes
Launch Phoenix
Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.
import phoenix as pxpx.launch_app()
Instrument Your Application
from openinference.instrumentation.openai import OpenAIInstrumentorfrom opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk import trace as trace_sdkfrom opentelemetry.sdk.trace.export import SimpleSpanProcessorendpoint ="http://127.0.0.1:6006/v1/traces"tracer_provider = trace_sdk.TracerProvider()tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Create Your Dataset
Download your data from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.
Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.
A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.
You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.
import osfrom getpass import getpassif os.environ.get("OPENAI_API_KEY")isNone: os.environ["OPENAI_API_KEY"]=getpass("🔑 Enter your OpenAI API key: ")
Next, define a function to format a prompt template and invoke an OpenAI model on an example.
From this function, you can use functools.partial to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.
import textwrapfrom functools import partialtemplate ="""Summarize the article in two to four sentences:ARTICLE======={article}SUMMARY======="""gpt_4o ="gpt-4o-2024-05-13"task =partial(summarize_article_openai, prompt_template=template, model=gpt_4o)test_example = dataset.examples[0]print(textwrap.fill(awaittask(test_example), width=100))
Define Your Evaluators
Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:
ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.
Run Experiments and Iterate on Your Prompt Template
Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.
from phoenix.experiments import run_experimentexperiment_results =run_experiment( dataset, task, experiment_name="initial-template", experiment_description="first experiment using a simple prompt template", experiment_metadata={"vendor": "openai", "model": gpt_4o}, evaluators=EVALUATORS,)
Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.
template ="""Summarize the article in two to four sentences. Be concise and include only the most important information.ARTICLE======={article}SUMMARY======="""task =partial(summarize_article_openai, prompt_template=template, model=gpt_4o)experiment_results =run_experiment( dataset, task, experiment_name="concise-template", experiment_description="explicitly instuct the llm to be concise", experiment_metadata={"vendor": "openai", "model": gpt_4o}, evaluators=EVALUATORS,)
Inspecting the experiment results, you'll notice that the average num_tokens has indeed increased, but the generated summaries are still far more verbose than the reference summaries.
Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.
# examples to include (not included in the uploaded dataset)train_df = ( hf_ds["train"].to_pandas().sample(n=5, random_state=42).head().rename(columns={"highlights": "summary"}))example_template ="""ARTICLE======={article}SUMMARY======={summary}"""examples ="\n".join( [ example_template.format(article=row["article"], summary=row["summary"])for _, row in train_df.iterrows() ])template ="""Summarize the article in two to four sentences. Be concise and include only the most important information, as in the examples below.EXAMPLES========{examples}Now summarize the following article.ARTICLE======={article}SUMMARY======="""template = template.format( examples=examples, article="{article}",)print(template)
By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.
Compare With Another Model (Optional)
⚠️ This section requires an Anthropic API key.
Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.
First, enter your Anthropic API key if it is not already present.
import osfrom getpass import getpassif os.environ.get("ANTHROPIC_API_KEY")isNone: os.environ["ANTHROPIC_API_KEY"]=getpass("🔑 Enter your Anthropic API key: ")
Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.
If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:
Here is a concise 3-sentence summary of the article...
See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.
Synopsis and Next Steps
Congrats! In this tutorial, you have:
Created a Phoenix dataset
Defined an experimental task and custom evaluators
Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall
As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the OpenAI fine-tuning API to train a bespoke model for your needs.