Arize allows you to customize your evaluation using your own criteria and prompt templates. You can do so by using our evals functions in our Phoenix library.
The llm_classify function is designed for categorical output, which can be Binary or Multi-Class. It ensures the output is clean and is one of the classes you want or UNPARSABLE.
A binary template looks like the following with only two values "irrelevant" and "relevant" that are expected from the LLM output:
CATEGORICAL_TEMPLATE =''' You are comparing a reference text to a question and trying to determine if the reference textcontains information relevant to answering the question. Here is the data: [BEGIN DATA] ************ [Question]: {query} ************ [Reference text]: {reference} [END DATA]Compare the Question above to the Reference text. You must determine whether the Reference textcontains information that can answer the Question. Please focus on whether the very specificquestion can be answered by the information in the Reference text.Your response must be single word, either "relevant" or "irrelevant",and should not contain any text or characters aside from that word."irrelevant" means that the reference text does not contain an answer to the Question."relevant" means the reference text contains an answer to the Question. '''
The categorical template defines the expected output of the LLM and the rails define the classes expected from the LLM:
irrelevant
relevant
# The rails is used to hold the output to specific values based on the template# It will remove text such as ",,," or "..."# Will ensure the binary value expected from the template is returned# MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"] rails = ["irrelevant","relevant"]relevance_classifications =llm_classify( dataframe=df, template=CATEGORICAL_TEMPLATE, model=model, rails=rails)
The classify uses a snap_to_rails function that searches the output string of the LLM for the classes in the classification list. It handles cases where no class is available, both classes are available or the string is a substring of the other class such as irrelevant and relevant.
#Rails examples#Removes extra information and maps to classllm_output_string ="The answer is relevant...!">"relevant"#Removes "." and capitalization from LLM output and maps to classllm_output_string ="Irrelevant.">"irrelevant"#No class in resposnellm_output_string ="I am not sure!">"UNPARSABLE"#Both classes in responsellm_output_string ="The answer is relevant i think, or maybe irrelevant...!">"UNPARSABLE"
A common use case is mapping the class to a 1 or 0 numeric value.
LLM Generated Scores
The Phoenix library does support numeric score Evals if you would like to use them. A template for a score Eval looks like the following.
SCORE_TEMPLATE = """
You are a helpful AI bot that checks for grammatical, spelling and typing errors
in a document context. You are going to return a continous score for the
document based on the percent of grammatical and typing errors. The score should be
between 10 and 1. A score of 1 will be no grammatical errors in any word,
a score of 2 will be 20% of words have errors, a 5 score will be 50% errors,
a score of 7 is 70%, and a 10 score will be all words in the context have a
grammatical errors.
The following is the document context.
#CONTEXT
{context}
#ENDCONTEXT
#QUESTION
Please return a score between 10 and 1.
You will return no other text or language besides the score. Only return the score.
Please return in a format that is "the score is: 10" or "the score is: 1"
"""
We use the more generic llm_generate function that can be used for almost any complex eval that doesn't fit into the categorical type.
test_results =llm_generate( dataframe=df, template=SCORE_TEMPLATE, model=model, verbose=True,# Callback function that will be called for each row of the dataframe output_parser=numeric_score_eval,# These two flags will add the prompt / response to the returned dataframe include_prompt=True, include_response=True,)defnumeric_score_eval(output,row_index):# This is the function that will be called for each row of the dataframe row = df.iloc[row_index] score = self.find_score(output)return{"score": score}deffind_score(self,output):# Regular expression pattern# It looks for 'score is', followed by any characters (.*?), and then a float or integer pattern =r"score is.*?([+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)" match = re.search(pattern, output, re.IGNORECASE)if match:# Extract and return the numberreturnfloat(match.group(1))else:returnNone
Custom code
If you would like to evaluate your application using your own code, you can do so by generating your own outputs, and sending the data to Arize using the log_evaluations function with our python SDK. See the example code below. The evals dataframe needs to have a column called context.span_id in order for Arize to know which traces you want to attach this evaluation to. You can get these by using the Arize Data API.
from arize.pandas.logger import Clientarize_client =Client(space_key={SPACE_KEY}, api_key={API_KEY})# Use Arize client to log evaluationsresponse = arize_client.log_evaluations( dataframe=evals_df, model_id=model_id, model_version=model_version,)# If successful, the server will return a status_code of 200if response.status_code !=200:print(f"❌ logging failed with response code {response.status_code}, {response.text}")else:print(f"✅ You have successfully logged evaluations to Arize")