NLP_Metrics
Install extra dependencies to compute LLM evaluation metrics
Last updated
Was this helpful?
Install extra dependencies to compute LLM evaluation metrics
Last updated
Was this helpful?
minimum required for NLP Metrics
Install extra dependencies in the SDK:
Import metrics from arize.pandas.generative.nlp_metrics
bleu
BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.
response_col
pd.Series
[Required] Pandas Series containing translations (as strings) to score
references_col
pd.Series
[Required] Pandas Series containing a reference, or list of several references, per prediction
max_order
int
[Optional] Maximum n-gram order to use when computing BLEU score. Defaults to 4
smooth
bool
[Optional] Whether or not to apply Lin et al. 2004 smoothing. Defaults to False
sacre_bleu
A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.
response_col
pd.Series
[Required] Pandas Series containing translations (as strings) to score
references_col
pd.Series
[Required] Pandas Series containing a reference, or list of several references, per prediction Note: There must be the same number of references for each prediction (i.e. all sub-lists must be of the same length)
smooth_method
str
[Optional] The smoothing method to use, defaults to 'exp'
. Possible values are:
- 'none'
: no smoothing
- 'floor'
: increment zero counts
- 'add-k'
: increment num/denom by k for n>1
- 'exp'
: exponential decay
smooth_value
float
[Optional] The smoothing value. Defaults to None
smooth_method='floor'
, smooth_value
defaults to 0.1
smooth_method='add-k'
, smooth_value
defaults to 1
lowercase
bool
[Optional] If True, lowercases the input, enabling case-insensitivity. Defaults to False
force
bool
[Optional] If True, insists that your tokenized input is actually de-tokenized. Defaults to False
use_effective_order
bool
[Optional] If True, stops including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. Defaults to False
google_bleu
BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.
response_col
pd.Series
Pandas Series containing translations (as strings) to score
references_col
pd.Series
Pandas Series containing a reference, or list of several references, for each translation
min_len
int
The minimum order of n-gram this function should extract. Defaults to 1
max_len
int
The maximum order of n-gram this function should extract. Defaults to 4
rouge
A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.
response_col
pd.Series
[Required] Pandas Series containing predictions (as strings) to score
references_col
pd.Series
[Required] Pandas Series containing a reference, or list of several references, per prediction
rogue_types
List[str]
[Optional] A list of rouge types to calculate. Defaults to ['rougeL']
. Valid rogue types:
-rogue1
: unigram (1-gram) based scoring
-rogue2
: bigram (2-gram) based scoring
-rogueL
: longest common subsequence based scoring
-rogueLSum
: splits text using '/n'
use_stemmer
bool
[Optional] If True, uses Porter stemmer to strip word suffixes. Defaults to False
.
meteor
An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.
response_col
pd.Series
[Required] Pandas Series containing predictions (as strings) to score
references_col
pd.Series
[Required] Pandas Series containing a reference, or list of several references, per prediction
alpha
float
[Optional] Parameter for controlling relative weights of precision and recall. Default is 0.9
beta
float
[Optional] Parameter for controlling shape of penalty as a function of fragmentation. Default is 3
gamma
float
[Optional] The relative weight assigned to fragmentation penalty. Default is 0.5