# LLM_Evaluation

Install extra dependencies to compute LLM evaluation metrics

minimum required for LLM Evaluation

Install extra dependencies in the SDK:

!pip install arize[LLM_Evaluation]

Import metrics from

`arize.pandas.generative.llm_evaluation`

BLEU score is typically used to evaluate the quality of machine-translated text from one natural language to another. BLEU calculates scores for individual translated segments, typically sentences, by comparing them to a set of high-quality reference translations. These scores are then averaged over the entire corpus to obtain an estimate of the overall quality of the translation.

Argument | Expected Type | Description |
---|---|---|

`response_col` | pd.Series | [Required] Pandas Series containing translations (as strings) to score |

`references_col` | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |

`max_order` | int | [Optional] Maximum n-gram order to use when computing BLEU score. Defaults to `4` |

`smooth` | bool | [Optional] Whether or not to apply Lin et al. 2004 smoothing. Defaults to `False` |

bleu_scores = bleu(

response_col=df['summary'],

references_col=df['reference_summary'],

max_order=5, # optional field

smooth=True # optional field

)

A hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official Workshop on Machine Translation (WMT) scores but works with plain text.

Argument | Expected Type | Description |
---|---|---|

`response_col` | pd.Series | [Required] Pandas Series containing translations (as strings) to score |

`references_col` | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction
Note: There must be the same number of references for each prediction (i.e. all sub-lists must be of the same length) |

`smooth_method` | str | [Optional] The smoothing method to use, defaults to `'exp'` . Possible values are:
- `'none'` : no smoothing
- `'floor'` : increment zero counts
- `'add-k'` : increment num/denom by k for n>1
- `'exp'` : exponential decay |

`smooth_value` | float | [Optional] The smoothing value. Defaults to `None`
`smooth_method='floor'` , `smooth_value` defaults to `0.1`
`smooth_method='add-k'` , `smooth_value` defaults to` 1` |

`lowercase` | bool | [Optional] If True, lowercases the input, enabling case-insensitivity. Defaults to `False` |

`force` | bool | [Optional] If True, insists that your tokenized input is actually de-tokenized. Defaults to `False` |

`use_effective_order` | bool | [Optional] If True, stops including n-gram orders for which precision is 0. This should be True, if sentence-level BLEU will be computed. Defaults to `False` |

sacre_bleu_scores = sacre_bleu(

response_col=df['summary'],

references_col=df['reference_summary'],

smooth_method='floor', # optional field

smooth_value=0.1, # defaulted to 0.1 since smooth_method is 'floor'

lowercase=True # optional field

)

BLEU score is typically used as a corpus measure, and it has some limitations when applied to single sentences. To overcome this issue in RL experiments, there exists a variation called the GLEU score. The GLEU score is the minimum of recall and precision.

Argument | Expected Type | Description |
---|---|---|

`response_col` | pd.Series | Pandas Series containing translations (as strings) to score |

`references_col` | pd.Series | Pandas Series containing a reference, or list of several references, for each translation |

`min_len` | int | The minimum order of n-gram this function should extract. Defaults to `1` |

`max_len` | int | The maximum order of n-gram this function should extract. Defaults to `4` |

google_bleu_scores = google_bleu(

response_col=df['summary'],

references_col=df['reference_summary'],

min_len=2, # optional field

max_len=5 # optional field

)

A software package and a set of metrics commonly used to evaluate machine translation and automatic summarization software in natural language processing. These metrics involve comparing a machine-produced summary or translation with a reference or set of references that have been human-produced.

Argument | Expected Type | Description |
---|---|---|

`response_col` | pd.Series | [Required] Pandas Series containing predictions (as strings) to score |

`references_col` | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |

`rogue_types` | List[str] | [Optional] A list of rouge types to calculate. Defaults to `['rougeL']` . Valid rogue types:
-`rogue1` : unigram (1-gram) based scoring
-`rogue2` : bigram (2-gram) based scoring- `rogueL` : longest common subsequence based scoring
-`rogueLSum` : splits text using '/n' |

`use_stemmer` | bool | [Optional] If True, uses Porter stemmer to strip word suffixes. Defaults to `False` . |

rouge_scores = rouge(

response_col=df['summary'],

references_col=df['reference_summary'],

rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'] # optional field

)

An automatic metric typically used to evaluate machine translation, which is based on a generalized concept of unigram matching between the machine-produced translation and the reference human-produced translations.

Argument | Expected Type | Description |
---|---|---|

`response_col` | pd.Series | [Required] Pandas Series containing predictions (as strings) to score |

`references_col` | pd.Series | [Required] Pandas Series containing a reference, or list of several references, per prediction |

`alpha` | float | [Optional] Parameter for controlling relative weights of precision and recall. Default is `0.9` |

`beta` | float | [Optional] Parameter for controlling shape of penalty as a function of fragmentation. Default is `3` |

`gamma` | float | [Optional] The relative weight assigned to fragmentation penalty. Default is `0.5` |

meteor_scores = meteor(

response_col=df['summary'],

references_col=df['reference_summary'],

alpha=0.8, # optional field

beta=4, # optional field

gamma=0.4 # optional field

)

Last modified 8d ago