LLM_Evaluation
Install extra dependencies to compute LLM evaluation metrics
minimum required for LLM Evaluation
Install extra dependencies in the SDK:
!pip install arize[LLM_Evaluation]
Import metrics from
arize.pandas.generative.llm_evaluation
BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.
Argument | Expected Type | Description |
---|---|---|
response_col | pd.Series | [Required] Pandas Series containing translations (as strings) to score |
references_col | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |
max_order | int | [Optional] Maximum n-gram order to use when computing BLEU score. Defaults to 4 |
smooth | bool | [Optional] Whether or not to apply Lin et al. 2004 smoothing. Defaults to False |
bleu_scores = bleu(
response_col=df['summary'],
references_col=df['reference_summary'],
max_order=5, # optional field
smooth=True # optional field
)
A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.
Argument | Expected Type | Description |
---|---|---|
response_col | pd.Series | [Required] Pandas Series containing translations (as strings) to score |
references_col | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction
Note: There must be the same number of references for each prediction (i.e. all sub-lists must be of the same length) |
smooth_method | str | [Optional] The smoothing method to use, defaults to 'exp' . Possible values are:
- 'none' : no smoothing
- 'floor' : increment zero counts
- 'add-k' : increment num/denom by k for n>1
- 'exp' : exponential decay |
smooth_value | float | [Optional] The smoothing value. Defaults to None
smooth_method='floor' , smooth_value defaults to 0.1
smooth_method='add-k' , smooth_value defaults to 1 |
lowercase | bool | [Optional] If True, lowercases the input, enabling case-insensitivity. Defaults to False |
force | bool | [Optional] If True, insists that your tokenized input is actually de-tokenized. Defaults to False |
use_effective_order | bool | [Optional] If True, stops including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. Defaults to False |
sacre_bleu_scores = sacre_bleu(
response_col=df['summary'],
references_col=df['reference_summary'],
smooth_method='floor', # optional field
smooth_value=0.1, # defaulted to 0.1 since smooth_method is 'floor'
lowercase=True # optional field
)
BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.
Argument | Expected Type | Description |
---|---|---|
response_col | pd.Series | Pandas Series containing translations (as strings) to score |
references_col | pd.Series | Pandas Series containing a reference, or list of several references, for each translation |
min_len | int | The minimum order of n-gram this function should extract. Defaults to 1 |
max_len | int | The maximum order of n-gram this function should extract. Defaults to 4 |
google_bleu_scores = google_bleu(
response_col=df['summary'],
references_col=df['reference_summary'],
min_len=2, # optional field
max_len=5 # optional field
)
A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.
Argument | Expected Type | Description |
---|---|---|
response_col | pd.Series | [Required] Pandas Series containing predictions (as strings) to score |
references_col | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |
rogue_types | List[str] | [Optional] A list of rouge types to calculate. Defaults to ['rougeL'] . Valid rogue types:
-rogue1 : unigram (1-gram) based scoring
-rogue2 : bigram (2-gram) based scoring- rogueL : longest common subsequence based scoring
-rogueLSum : splits text using '/n' |
use_stemmer | bool | [Optional] If True, uses Porter stemmer to strip word suffixes. Defaults to False . |
rouge_scores = rouge(
response_col=df['summary'],
references_col=df['reference_summary'],
rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'] # optional field
)
An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.
Argument | Expected Type | Description |
---|---|---|
response_col | pd.Series | [Required] Pandas Series containing predictions (as strings) to score |
references_col | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |
alpha | float | [Optional] Parameter for controlling relative weights of precision and recall. Default is 0.9 |
beta | float | [Optional] Parameter for controlling shape of penalty as a function of fragmentation. Default is 3 |
gamma | float | [Optional] The relative weight assigned to fragmentation penalty. Default is 0.5 |
meteor_scores = meteor(
response_col=df['summary'],
references_col=df['reference_summary'],
alpha=0.8, # optional field
beta=4, # optional field
gamma=0.4 # optional field
)
Last modified 8d ago