Create an Experiment Evaluator

Evaluator as a Function

The evaluator of run_experiment can be a callable function. The function should have optional inputs of the following:

Parameter name
Description
Example

input

experiment run input

def eval(input): ...

output

experiment run output

def eval(output): ...

dataset_row

the entire row of the data, including every column as dictionary key

def eval(dataset_row): ...

metadata

experiment metadata

def eval(metadata): ...

Define Function Evaluator and Run Experiment

def edit_distance(dataset_row, output):
    str1 = dataset_row['attributes.str1'] #Input used in task
    str2 = output #Output from task
    dp = [[i + j if i * j == 0 else 0 for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            dp[i][j] = dp[i - 1][j - 1] if str1[i - 1] == str2[j - 1] else 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
    return dp[-1][-1]
    
`
experiment1 = arize_client.run_experiment(space_id=space_id,
    dataset_name=dataset_name, task=prompt_gen_task, evaluators=[edit_distance], 
    experiment_name="test")

Evaluator as a Class

Users have the option to run an experiment by creating an evaluator that inherits from the Evaluator(ABC) base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an EvaluationResult dataclass.

Refer to the following example:


class Hallucination(Evaluator):
    annotator_kind = "CODE"
    name = "factual_hallucination"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        expected_output = dataset_row['attributes.output.value']
        input = json.loads(dataset_row['attributes.llm.prompt_template.variables'])
        query_str = input['query_str']
        context_str = input['context_str']
        ################################
        ####### Expected EVAL #####
        df_in = pd.DataFrame({"output": output, "input": query_str, "reference":context_str}, index=[0])
        rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())
        expect_df = llm_classify(
            dataframe=df_in,
            template= HALLUCINATION_PROMPT_TEMPLATE,
            model=OpenAIModel(model="gpt-4-turbo-preview"),
            rails=rails,
            provide_explanation=True,
            run_sync=True
        )
        label = expect_df['label'][0]
        score = 1 if rails and label == rails[1] else 0 #Choose the 0 item in rails as the correct "1" label
        explanation = expect_df['explanation'][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)

    async def async_evaluate(self, _: Example, exp_run: ExperimentRun) -> EvaluationResult:
        return self.evaluate(_, exp_run)

Last updated

Copyright © 2023 Arize AI, Inc