Online: Code Based Evals

This feature is in closed Beta

Arize supports code based evaluations for experiments. These can be written in python and run either in code (offline) or in platform (online).

Please reference the code based evaluators here.

The above is the user interface configuration for a online code based Eval. Arize supports sampling and filtering on a per Eval task basis configurable with the task.

Feature

Online Eval

Offline Eval

Code Executes

In Platform Server

Python Client Side

Data Available

Every Span Attribute and Eval

Created

UI or Python SDK

Python Code

Run Statistics

Eval Task Execution Statistics

N/A

Tracing

N/A

Supported Relative to Experiment

Python Libraries

Full Support of Public Accessible Libraries

Full Support of Pip Accessible Libraries

Execution Time Libraries

Libraries Pre-Downloaded in Requirements

Any Library

Library Version

Any Version

Internet Content

No Network Access in Python

Any

The online code based Eval runs server side as data is ingested. It runs in a isolated container that is preloaded with the libraries in the requirements. Any version can be specified of any code based Eval library as every container is pre-loaded with the specific libraries.

Writing an Online Code Eval

The online Evals for code are supported using the same approach as the code based Eval for offline use. One can just copy code from an Eval into the user interface or push through the API.

class LanguageDetectionEvaluator(BaseArizeEvaluator):
  def __init__(self):
    # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes.
    # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage
    self._threshold = 0.7
    self._translation_model = "facebook/nllb-200-distilled-600M"    

  def filter(self, dataset_row: Dict[str, Any]) -> bool:
    if dataset_row["attributes.llm.input_messages"] is None:
      return True
    return False
  
  def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult:
    user_message = dataset_row["attributes.llm.input_messages"]
    prediction = detect(user_message.splitlines()[0])
    pred_language_iso, pred_confidence = (
        prediction.get("lang"),
        prediction.get("score"),
    )
    explanation = f"pred_language_iso: {pred_language_iso}, pred_confidence: {pred_confidence}"
    return EvaluationResult(
        label=language_name(pred_language_iso),
        explanation=explanation,
    )

The above is a code based Eval using the BaseArizeEvaluator class. The evaluate method uses a 3rd party model for language detection.

All the Code Evaluator types are supported an evaluator can return a score, label and EvaluationResult.

class LengthEvaluator(BaseArizeEvaluator):
  def __init__(self):
    # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes.
    # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage
    self._threshold = 0.7
    self._translation_model = "facebook/nllb-200-distilled-600M"    

  def filter(self, dataset_row: Dict[str, Any]) -> bool:
    if dataset_row["attributes.llm.input_messages"] is None:
      return True
    return False
  
    
  def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult:
    user_message = dataset_row["attributes.llm.output"]
    return len(user_message)

Trouble Shooting Online Task Runs

Online tasks are run on incoming data and understanding what has run on what data, can be complicated. Arize provides detailed information on what was applied to what specific incoming data.

The above shows examples of data that is run, skipped and or processed.

PreviousGithub Action Basics NextExperiments in Arize vs. Phoenix

Last updated 4 months ago

class LanguageDetectionEvaluator(BaseArizeEvaluator): def __init__(self): # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes. # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage self._threshold = 0.7 self._translation_model = "facebook/nllb-200-distilled-600M" def filter(self, dataset_row: Dict[str, Any]) -> bool: if dataset_row["attributes.llm.input_messages"] is None: return True return False def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult: user_message = dataset_row["attributes.llm.input_messages"] prediction = detect(user_message.splitlines()[0]) pred_language_iso, pred_confidence = ( prediction.get("lang"), prediction.get("score"), ) explanation = f"pred_language_iso: {pred_language_iso}, pred_confidence: {pred_confidence}" return EvaluationResult( label=language_name(pred_language_iso), explanation=explanation, )

class LengthEvaluator(BaseArizeEvaluator): def __init__(self): # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes. # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage self._threshold = 0.7 self._translation_model = "facebook/nllb-200-distilled-600M" def filter(self, dataset_row: Dict[str, Any]) -> bool: if dataset_row["attributes.llm.input_messages"] is None: return True return False def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult: user_message = dataset_row["attributes.llm.output"] return len(user_message)