Online: Code Based Evals

This feature is in closed Beta

Arize supports code based evaluations for experiments. These can be written in python and run either in code (offline) or in platform (online).

Please reference the code based evaluators here.

The above is the user interface configuration for a online code based Eval. Arize supports sampling and filtering on a per Eval task basis configurable with the task.

FeatureOnline EvalOffline Eval

Code Executes

In Platform Server

Python Client Side

Data Available

Every Span Attribute and Eval

Every Span Attribute and Eval

Created

UI or Python SDK

Python Code

Run Statistics

Eval Task Execution Statistics

N/A

Tracing

N/A

Supported Relative to Experiment

Python Libraries

Full Support of Public Accessible Libraries

Full Support of Pip Accessible Libraries

Execution Time Libraries

Libraries Pre-Downloaded in Requirements

Any Library

Library Version

Any Version

Any Version

Internet Content

No Network Access in Python

Any

The online code based Eval runs server side as data is ingested. It runs in a isolated container that is preloaded with the libraries in the requirements. Any version can be specified of any code based Eval library as every container is pre-loaded with the specific libraries.

Writing an Online Code Eval

The online Evals for code are supported using the same approach as the code based Eval for offline use. One can just copy code from an Eval into the user interface or push through the API.

class LanguageDetectionEvaluator(BaseArizeEvaluator):
  def __init__(self):
    # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes.
    # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage
    self._threshold = 0.7
    self._translation_model = "facebook/nllb-200-distilled-600M"    

  def filter(self, dataset_row: Dict[str, Any]) -> bool:
    if dataset_row["attributes.llm.input_messages"] is None:
      return True
    return False
  
  def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult:
    user_message = dataset_row["attributes.llm.input_messages"]
    prediction = detect(user_message.splitlines()[0])
    pred_language_iso, pred_confidence = (
        prediction.get("lang"),
        prediction.get("score"),
    )
    explanation = f"pred_language_iso: {pred_language_iso}, pred_confidence: {pred_confidence}"
    return EvaluationResult(
        label=language_name(pred_language_iso),
        explanation=explanation,
    )

The above is a code based Eval using the BaseArizeEvaluator class. The evaluate method uses a 3rd party model for language detection.

All the Code Evaluator types are supported an evaluator can return a score, label and EvaluationResult.

class LengthEvaluator(BaseArizeEvaluator):
  def __init__(self):
    # The model: "facebook/nllb-200-distilled-600M" does not use the same language codes as the ISO 639-1 codes.
    # Taken from https://github.com/openlanguagedata/flores?tab=readme-ov-file#language-coverage
    self._threshold = 0.7
    self._translation_model = "facebook/nllb-200-distilled-600M"    

  def filter(self, dataset_row: Dict[str, Any]) -> bool:
    if dataset_row["attributes.llm.input_messages"] is None:
      return True
    return False
  
    
  def evaluate(self, dataset_row: Dict[str, Any]) -> EvaluationResult:
    user_message = dataset_row["attributes.llm.output"]
    return len(user_message)

Trouble Shooting Online Task Runs

Online tasks are run on incoming data and understanding what has run on what data, can be complicated. Arize provides detailed information on what was applied to what specific incoming data.

The above shows examples of data that is run, skipped and or processed.

Last updated

Copyright © 2023 Arize AI, Inc