Summarization Eval

PreviousCode Generation Eval NextBuilding Your Own Evals

Last updated 1 year ago

Was this helpful?

Summarization Eval Template

    You are comparing the summary text and it's original document and trying to determine
    if the summary is good. Here is the data:
    [BEGIN DATA]
    ************
    [Summary]: {summary}
    ************
    [Original Document]: {document}
    [END DATA]
    Compare the Summary above to the Original Document and determine if the Summary is
    comprehensive, concise, coherent, and independent relative to the Original Document.
    Your response must be a string, either good or bad, and should not contain any text
    or characters aside from that. The string bad means that the Summary is not comprehensive, concise,
    coherent, and independent relative to the Original Document. The string good means the Summary
    is comprehensive, concise, coherent, and independent relative to the Original Document.

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

How To Run the Eval

import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.experimental.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

The above shows how to use the summarization Eval template.

Eval Summary

GPT-4

GPT-3.5

GPT-3.5 Instruct

Palm 2 (Text Bison)

Claud V2

Llama 7b (soon)

Precision

0.79

0.57

0.75

Recall

0.88

0.1

0.16

0.7

0.61

0.83

0.18

0.280

0.63

0.67

Summarization Eval Template

    You are comparing the summary text and it's original document and trying to determine
    if the summary is good. Here is the data:
    [BEGIN DATA]
    ************
    [Summary]: {summary}
    ************
    [Original Document]: {document}
    [END DATA]
    Compare the Summary above to the Original Document and determine if the Summary is
    comprehensive, concise, coherent, and independent relative to the Original Document.
    Your response must be a string, either good or bad, and should not contain any text
    or characters aside from that. The string bad means that the Summary is not comprehensive, concise,
    coherent, and independent relative to the Original Document. The string good means the Summary
    is comprehensive, concise, coherent, and independent relative to the Original Document.

We are continually iterating our templates, view the most up-to-date template on GitHub. Last updated on 10/12/2023

How To Run the Eval

import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.experimental.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned 
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_classify(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

The above shows how to use the summarization Eval template.

Eval Summary

GPT-4

GPT-3.5

GPT-3.5 Instruct

Palm 2 (Text Bison)

Claud V2

Llama 7b (soon)

Precision

0.79

0.57

0.75

Recall

0.88

0.1

0.16

0.7

0.61

0.83

0.18

0.280

0.63

0.67

Summarization Eval

Summarization Eval

When To Use Summarization Eval Template