NLP_Metrics
Install extra dependencies to compute LLM evaluation metrics
Install extra dependencies in the SDK:
Import metrics from arize.pandas.generative.nlp_metrics
Metrics
bleu
bleu
BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.
Code Example
sacre_bleu
sacre_bleu
A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.
Code Example
google_bleu
google_bleu
BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.
Code Example
rouge
rouge
A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.
Code Example
meteor
meteor
An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.
Code Example
Last updated