Offline Evaluations

Follow along in our colab guide for offline evaluations

Evaluations are essential to understanding how well your model is performing in real-world scenarios, allowing you to identify strengths, weaknesses, and areas of improvement.

Offline evaluations are run as code and then sent back to Arize using log_evaluations_sync.

This guide assumes you have traces in Arize and are looking to run an evaluation to measure your application performance.

To add evaluations you can set up online evaluations as a task to run automatically, or you can follow the steps below to generate evaluations and log them to Arize:

Install the Arize SDK and OpenAI

pip install -q "arize-phoenix>=4.29.0"
pip install -q openai

Import your spans in code

Once you have traces in Arize, you can visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy paste to your evaluator.

# this will be prefilled by the export command. 
# Note: This uses a different API Key than the one above.
ARIZE_API_KEY = ''

# import statements required for getting your spans
import os
os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY
from datetime import datetime
from arize.exporter import ArizeExportClient 
from arize.utils.types import Environments

# Exporting your dataset into a dataframe
client = ArizeExportClient()
primary_df = client.export_model_to_df(
    space_id='', # this will be prefilled by export
    model_id='', # this will be prefilled by export
    environment=Environments.TRACING, 
    start_time=datetime.fromisoformat(''), # this will be prefilled by export 
    end_time=datetime.fromisoformat(''), # this will be prefilled by export
)

Run a custom evaluator using Phoenix

import os
from phoenix.evals import OpenAIModel, llm_classify

Ensure you have your OpenAI API keys setup correctly for your OpenAI model.

api_key = os.environ.get("OPENAI_API_KEY")
eval_model = OpenAIModel(
    model="gpt-4o", temperature=0, api_key=api_key
)

Create a prompt template for the LLM to judge the quality of your responses. You can utilize any of the Arize Evaluator Templates or you can create your own. Below is an example which judges the positivity or negativity of the LLM output.

MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

Notice the variables in brackets for {input} and {output} above. You will need to set those variables appropriately for the dataframe so you can run your custom template. We use OpenInference as a set of conventions (complementary to OpenTelemetry) to trace AI applications. This means depending on the provider you are using, the attributes of the trace will be different.

You can use the code below to check which attributes are in the traces in your dataframe.

primary_df.columns

Use the code below to set the input and output variables needed for the prompt above.

primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]

Use the llm_classify function to run the evaluation using your custom template. You will be using the dataframe from the traces you generated above.

evals_df = llm_classify(
    dataframe=primary_df,
    template = MY_CUSTOM_TEMPLATE,
    model=eval_model,
    rails=["positive", "negative"]
)

If you'd like more information, see our detailed guide on custom evaluators. You can also use our pre-tested evaluators for evaluating hallucination, toxicity, retrieval, etc.

Log evaluations back to Arize

Use the log_evaluations_sync function as part of our Python SDK to attach evaluations you've run to traces. The code below assumes that you have already completed an evaluation run, and you have the evals_dataframe object. It also assumes you have a traces_dataframe object to get the span_id that you need to attach the evals.

The evals_dataframe requires four columns, which should be auto-generated for you based on the evaluation you ran using Phoenix. The <eval_name> must be alphanumeric and cannot have hyphens or spaces.

  • eval.<eval_name>.label

  • eval.<eval_name>.score

  • eval.<eval_name>.explanation

  • context.span_id

An example evaluation data dictionary would look like:

evaluation_data = {
   'context.span_id': ['74bdfb83-a40e-4351-9f41-19349e272ae9'],  # Use your span_id
   'eval.myeval.label': ['accuracy'],  # Example label name
   'eval.myeval.score': [0.95],        # Example label value
   'eval.myeval.explanation': ["some explanation"]
}
evaluation_df = pd.DataFrame(evaluation_data)

Here is sample code to run your evaluation and log it in real-time in the Arize platform.

import os
from arize.pandas.logger import Client

API_KEY = os.environ.get("ARIZE_API_KEY")
SPACE_ID = os.environ.get("ARIZE_SPACE_ID")
DEVELOPER_KEY = os.environ.get('ARIZE_DEVELOPER_KEY')

# Initialize Arize client using the model_id and version you used previously
arize_client = Client(
    space_id=SPACE_ID, 
    api_key=API_KEY,
    developer_key=DEVELOPER_KEY
)

# Set the evals_df to have the correct span ID to log it to Arize
evals_df = evals_df.set_index(primary_df["context.span_id"])

# send the eval to Arize
arize_client.log_evaluations_sync(evals_df, 'YOUR_PROJECT_NAME')

Last updated

Copyright © 2023 Arize AI, Inc