Quickstart: Evaluation

This guide assumes you have and are looking to run an evaluation to measure your application performance. If you want to learn more about evaluation, read our guide on .

Here's how you add LLM evaluations:

Install the Arize SDK

pip install -q 'arize[LLM_Evaluation]'

conda install -c conda-forge arize

Import your spans in code

Once you have traces in Arize, you can visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button, you can get the boilerplate code to copy paste to your evaluator.

# this will be prefilled by the export command. 
# Note: This uses a different API Key than the one above.
ARIZE_API_KEY = ''

# import statements required for getting your spans
import os
os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY
from datetime import datetime
from arize.exporter import ArizeExportClient 
from arize.utils.types import Environments

# Exporting your dataset into a dataframe
client = ArizeExportClient()
primary_df = client.export_model_to_df(
    space_id='', # this will be prefilled by export
    model_id='', # this will be prefilled by export
    environment=Environments.TRACING, 
    start_time=datetime.fromisoformat(''), # this will be prefilled by export 
    end_time=datetime.fromisoformat(''), # this will be prefilled by export
)

Run a custom evaluator using Phoenix

Import the functions from our Phoenix library to run a custom evaluation using OpenAI.

import os
from phoenix.evals import OpenAIModel, llm_classify

Ensure you have your OpenAI API keys setup correctly for your OpenAI model.

api_key = os.environ.get("OPENAI_API_KEY")
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview", temperature=0, api_key=api_key
)

Create a prompt template for the LLM to judge the quality of your responses. Below is an example which judges the positivity or negativity of the LLM output.

MY_CUSTOM_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]


    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

You can use the code below to check which attributes are in the traces in your dataframe.

primary_df.columns

Use the code below to set the input and output variables needed for the prompt above.

primary_df["input"] = primary_df["attributes.llm.input_messages.contents"]
primary_df["output"] = primary_df["attributes.llm.output_messages.contents"]

evals_df = llm_classify(
    dataframe=primary_df,
    template = MY_CUSTOM_TEMPLATE,
    model=eval_model,
    rails=["positive", "negative"]
)

Log evaluations back to Arize

Currently, our evaluations are logged within Arize every 24 hours, and we're working on making them as close to instant as possible! Reach out to support@arize.com if you're having trouble here.

import os
from arize.pandas.logger import Client

API_KEY = os.environ.get("ARIZE_API_KEY")
SPACE_KEY = os.environ.get("ARIZE_SPACE_KEY")

# Initialize Arize client using the model_id and version you used previously
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "quickstart-llm-tutorial"

# Set the evals_df to have the correct span ID to log it to Arize
evals_df = evals_df.set_index(primary_df["context.span_id"])

# Use Arize client to log evaluations
response = arize_client.log_evaluations(
    dataframe=evals_df,
    model_id=model_id,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged evaluations to Arize")

Last updated 10 months ago

Was this helpful?

# this will be prefilled by the export command. # Note: This uses a different API Key than the one above. ARIZE_API_KEY = '' # import statements required for getting your spans import os os.environ['ARIZE_API_KEY'] = ARIZE_API_KEY from datetime import datetime from arize.exporter import ArizeExportClient from arize.utils.types import Environments # Exporting your dataset into a dataframe client = ArizeExportClient() primary_df = client.export_model_to_df( space_id='', # this will be prefilled by export model_id='', # this will be prefilled by export environment=Environments.TRACING, start_time=datetime.fromisoformat(''), # this will be prefilled by export end_time=datetime.fromisoformat(''), # this will be prefilled by export )

MY_CUSTOM_TEMPLATE = ''' You are evaluating the positivity or negativity of the responses to questions. [BEGIN DATA] ************ [Question]: {input} ************ [Response]: {output} [END DATA] Please focus on the tone of the response. Your answer must be single word, either "positive" or "negative" '''

import os from arize.pandas.logger import Client API_KEY = os.environ.get("ARIZE_API_KEY") SPACE_KEY = os.environ.get("ARIZE_SPACE_KEY") # Initialize Arize client using the model_id and version you used previously arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY) model_id = "quickstart-llm-tutorial" # Set the evals_df to have the correct span ID to log it to Arize evals_df = evals_df.set_index(primary_df["context.span_id"]) # Use Arize client to log evaluations response = arize_client.log_evaluations( dataframe=evals_df, model_id=model_id, ) # If successful, the server will return a status_code of 200 if response.status_code != 200: print(f"❌ logging failed with response code {response.status_code}, {response.text}") else: print(f"✅ You have successfully logged evaluations to Arize")