Summarization
Imagine you're deploying a service for your media company's summarization model that condenses daily news into concise summaries to be displayed online. One challenge of using LLMs for summarization is that even the best models tend to be verbose.
In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces concise yet accurate summaries. You will:
Upload a dataset of examples containing articles and human-written reference summaries to Phoenix
Define an experiment task that summarizes a news article
Devise evaluators for length and ROUGE score
Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs
⚠️ This tutorial requires and OpenAI API key, and optionally, an Anthropic API key.
Let's get started!
Install Dependencies and Import Libraries
Install requirements and import libraries.
Launch Phoenix
Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI.
Instrument Your Application
Create Your Dataset
Download your data from HuggingFace and inspect a random sample of ten rows. This dataset contains news articles and human-written summaries that we will use as a reference against which to compare our LLM generated summaries.
Upload the data as a dataset in Phoenix and follow the link in the cell output to inspect the individual examples of the dataset. Later in the notebook, you will run experiments over this dataset in order to iteratively improve your summarization application.
Define Your Experiment Task
A task is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM. An experiment maps a task across all the examples in a dataset and optionally executes evaluators to grade the task outputs.
You'll start by defining your task, which in this case, invokes OpenAI. First, set your OpenAI API key if it is not already present as an environment variable.
Next, define a function to format a prompt template and invoke an OpenAI model on an example.
From this function, you can use functools.partial
to derive your first task, which is a callable that takes in an example and returns an output. Test out your task by invoking it on the test example.
Define Your Evaluators
Evaluators take the output of a task (in this case, a string) and grade it, often with the help of an LLM. In your case, you will create ROUGE score evaluators to compare the LLM-generated summaries with the human reference summaries you uploaded as part of your dataset. There are several variants of ROUGE, but we'll use ROUGE-1 for simplicity:
ROUGE-1 precision is the proportion of overlapping tokens (present in both reference and generated summaries) that are present in the generated summary (number of overlapping tokens / number of tokens in the generated summary)
ROUGE-1 recall is the proportion of overlapping tokens that are present in the reference summary (number of overlapping tokens / number of tokens in the reference summary)
ROUGE-1 F1 score is the harmonic mean of precision and recall, providing a single number that balances these two scores.
Higher ROUGE scores mean that a generated summary is more similar to the corresponding reference summary. Scores near 1 / 2 are considered excellent, and a model fine-tuned on this particular dataset achieved a rouge score of ~0.44.
Since we also care about conciseness, you'll also define an evaluator to count the number of tokens in each generated summary.
Note that you can use any third-party library you like while defining evaluators (in your case, rouge
and tiktoken
).
Run Experiments and Iterate on Your Prompt Template
Run your first experiment and follow the link in the cell output to inspect the task outputs (generated summaries) and evaluations.
Our initial prompt template contained little guidance. It resulted in an ROUGE-1 F1-score just above 0.3 (this will vary from run to run). Inspecting the task outputs of the experiment, you'll also notice that the generated summaries are far more verbose than the reference summaries. This results in high ROUGE-1 recall and low ROUGE-1 precision. Let's see if we can improve our prompt to make our summaries more concise and to balance out those recall and precision scores while maintaining or improving F1. We'll start by explicitly instructing the LLM to produce a concise summary.
Inspecting the experiment results, you'll notice that the average num_tokens
has indeed increased, but the generated summaries are still far more verbose than the reference summaries.
Instead of just instructing the LLM to produce concise summaries, let's use a few-shot prompt to show it examples of articles and good summaries. The cell below includes a few articles and reference summaries in an updated prompt template.
Now run the experiment.
By including examples in the prompt, you'll notice a steep decline in the number of tokens per summary while maintaining F1.
Compare With Another Model (Optional)
⚠️ This section requires an Anthropic API key.
Now that you have a prompt template that is performing reasonably well, you can compare the performance of other models on this particular task. Anthropic's Claude is notable for producing concise and to-the-point output.
First, enter your Anthropic API key if it is not already present.
Next, define a new task that summarizes articles using the same prompt template as before. Then, run the experiment.
If your experiment does not produce more concise summaries, inspect the individual results. You may notice that some summaries from Claude 3.5 Sonnet start with a preamble such as:
See if you can tweak the prompt and re-run the experiment to exclude this preamble from Claude's output. Doing so should result in the most concise summaries yet.
Synopsis and Next Steps
Congrats! In this tutorial, you have:
Created a Phoenix dataset
Defined an experimental task and custom evaluators
Iteratively improved a prompt template to produce more concise summaries with balanced ROUGE-1 precision and recall
As next steps, you can continue to iterate on your prompt template. If you find that you are unable to improve your summaries with further prompt engineering, you can export your dataset from Phoenix and use the OpenAI fine-tuning API to train a bespoke model for your needs.
Last updated