NLP_Metrics

Install extra dependencies to compute LLM evaluation metrics

Install extra dependencies in the SDK:

!pip install arize[NLP_Metrics] 

Import metrics from arize.pandas.generative.nlp_metrics

Metrics

bleu

BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.

ArgumentExpected TypeDescription

response_col

pd.Series

[Required] Pandas Series containing translations (as strings) to score

references_col

pd.Series

[Required] Pandas Series containing a reference, or list of several references, per prediction

max_order

int

[Optional] Maximum n-gram order to use when computing BLEU score. Defaults to 4

smooth

bool

[Optional] Whether or not to apply Lin et al. 2004 smoothing. Defaults to False

Code Example

bleu_scores = bleu(
    response_col=df['summary'], 
    references_col=df['reference_summary'],
    max_order=5, # optional field
    smooth=True # optional field
    )

sacre_bleu

A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.

ArgumentExpected TypeDescription

response_col

pd.Series

[Required] Pandas Series containing translations (as strings) to score

references_col

pd.Series

[Required] Pandas Series containing a reference, or list of several references, per prediction Note: There must be the same number of references for each prediction (i.e. all sub-lists must be of the same length)

smooth_method

str

[Optional] The smoothing method to use, defaults to 'exp'. Possible values are: - 'none': no smoothing - 'floor': increment zero counts - 'add-k': increment num/denom by k for n>1 - 'exp': exponential decay

smooth_value

float

[Optional] The smoothing value. Defaults to None smooth_method='floor', smooth_value defaults to 0.1 smooth_method='add-k', smooth_value defaults to 1

lowercase

bool

[Optional] If True, lowercases the input, enabling case-insensitivity. Defaults to False

force

bool

[Optional] If True, insists that your tokenized input is actually de-tokenized. Defaults to False

use_effective_order

bool

[Optional] If True, stops including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. Defaults to False

Code Example

sacre_bleu_scores = sacre_bleu(
    response_col=df['summary'], 
    references_col=df['reference_summary'],
    smooth_method='floor', # optional field
    smooth_value=0.1, # defaulted to 0.1 since smooth_method is 'floor'
    lowercase=True # optional field
    )

google_bleu

BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.

ArgumentExpected TypeDescription

response_col

pd.Series

Pandas Series containing translations (as strings) to score

references_col

pd.Series

Pandas Series containing a reference, or list of several references, for each translation

min_len

int

The minimum order of n-gram this function should extract. Defaults to 1

max_len

int

The maximum order of n-gram this function should extract. Defaults to 4

Code Example

google_bleu_scores = google_bleu(
    response_col=df['summary'], 
    references_col=df['reference_summary'], 
    min_len=2, # optional field
    max_len=5 # optional field
    )

rouge

A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.

ArgumentExpected TypeDescription

response_col

pd.Series

[Required] Pandas Series containing predictions (as strings) to score

references_col

pd.Series

[Required] Pandas Series containing a reference, or list of several references, per prediction

rogue_types

List[str]

[Optional] A list of rouge types to calculate. Defaults to ['rougeL']. Valid rogue types: -rogue1: unigram (1-gram) based scoring -rogue2: bigram (2-gram) based scoring

-rogueL: longest common subsequence based scoring -rogueLSum: splits text using '/n'

use_stemmer

bool

[Optional] If True, uses Porter stemmer to strip word suffixes. Defaults to False.

Code Example

rouge_scores = rouge(
    response_col=df['summary'], 
    references_col=df['reference_summary'], 
    rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'] # optional field
    )

meteor

An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.

ArgumentExpected TypeDescription

response_col

pd.Series

[Required] Pandas Series containing predictions (as strings) to score

references_col

pd.Series

[Required] Pandas Series containing a reference, or list of several references, per prediction

alpha

float

[Optional] Parameter for controlling relative weights of precision and recall. Default is 0.9

beta

float

[Optional] Parameter for controlling shape of penalty as a function of fragmentation. Default is 3

gamma

float

[Optional] The relative weight assigned to fragmentation penalty. Default is 0.5

Code Example

meteor_scores = meteor(
    response_col=df['summary'], 
    references_col=df['reference_summary'], 
    alpha=0.8, # optional field
    beta=4, # optional field
    gamma=0.4 # optional field
    )

Last updated

Copyright © 2023 Arize AI, Inc