LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Evaluation Inputs
  • Evaluation Outputs
  • Create an LLM Evaluator
  • Create a code evaluator
  • Code Evaluators
  • Custom Code Evaluators
  • Prebuilt Phoenix Code Evaluators
  • Advanced: Evaluator as a Class
  • Eval Class Inputs
  • EvaluationResult Outputs
  • Code Evaluator as Class
  • LLM Evaluator as Class Example

Was this helpful?

  1. Develop
  2. Experiments

Evaluate experiment with code

How to write the functions to evaluate your task outputs in experiments

Last updated 8 days ago

Was this helpful?

If you're starting out, check out the .

Here's the simplest version of an evaluation function:

def is_true(output):
    # output is the task output
    return output == True

You can define a simple function to read the output of a task and check it.

Evaluation Inputs

The evaluator function can take the following optional arguments:

Parameter name
Description
Example

dataset_row

the entire row of the data, including every column as dictionary key

def eval(dataset_row): ...

input

experiment run input, which is mapped to attributes.input.value

def eval(input): ...

output

experiment run output

def eval(output): ...

dataset_output

the expected output if available, mapped to attributes.output.value

def eval(dataset_output): ...

metadata

dataset_row metadata, which is mapped to attributes.metadata

def eval(metadata): ...

Evaluation Outputs

We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.

Evaluator Output Type
Example
How it appears in Arize

boolean

True

label = 'True' score = 1.0

float

1.0

score = 1.0

string

"reasonable"

label = 'reasonable'

tuple

(1.0, "my explanation notes")

score = 1.0 explanation = 'my explanation notes'

tuple

("True", 1.0, "my explanation")

label = 'True' score = 1.0 explanation = "my explanation"

EvaluationResult

EvaluationResult(

score=1,

label='reasonable', explanation='explanation'

metadata={}

)

score = 1.0

label='reasonable' explanation = 'explanation' metadata={}

  • from arize.experimental.datasets.experiments.types import EvaluationResult

  • One of label or score must be supplied (you can't have an evaluation with no result).

Here's another example which compares the output to a value in the dataset_row.

"""Example dataset
dataframe = pd.DataFrame({
    "expected": [2]
})
"""

def is_equal(dataset_row, output):
    expected = dataset_row.get("expected")
    return expected == output
arize_client.run_experiment(
    space_id="",
    dataset_name="", 
    task="", 
    evaluators=[is_equal], 
    experiment_name=""
)

Create an LLM Evaluator

LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.

Arize supports a large number of LLM evaluators out of the box with LLM Classify:

Arize Templates

Here's an example of a LLM evaluator that checks for hallucinations in the model output:

from phoenix.evals import llm_classify
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel

HALLUCINATION_PROMPT_TEMPLATE = """
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. 

Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.

    # Query: {query}
    # Reference text: {reference}
    # Answer: {response}

Is the answer above factual or hallucinated based on the query and reference text?
"""

def hallucination_eval(output, dataset_row):
    # Get the original query and reference text from the dataset_row
    query = dataset_row.get("query")
    reference = dataset_row.get("reference")

    # Create a DataFrame to pass into llm_classify
    df_in = pd.DataFrame(
        {"query": query, "reference": reference, "response": output}, index=[0]
    )
    
    # Run the LLM classification
    eval_df = llm_classify(
        dataframe=df_in,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
        rails=["factual", "hallucinated"],
        provide_explanation=True,
    )
    
    # Map the eval df to EvaluationResult
    label = eval_df["label"][0]
    score = 1 if label == "factual" else 0
    explanation = eval_df["explanation"][0]
    
    # Return the evaluation result
    return EvaluationResult(label=label, score=score, explanation=explanation)

In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.

Once you define your evaluator class, you can use it in your experiment run like this:

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[hallucination_eval],
    experiment_name=experiment_name,
)

You can customize LLM evaluators to suit your experiment's needs — update the template with your instructions and the rails with the desired output.

Create a code evaluator

Code Evaluators

Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.

Custom Code Evaluators

Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.

For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:

def in_bounds(output):
    return 1 <= output <= 100

By passing the in_bounds function to run_experiment, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=DATASET_ID,
    task=run_task,
    evaluators=[in_bounds],
    experiment_name=experiment_name,
)

Prebuilt Phoenix Code Evaluators

Alternatively, you can leverage one of our prebuilt evaluators using phoenix.experiments.evaluators

This evaluator checks whether the output of an experiment run is a JSON-parsable string. It's useful when you want to ensure that the generated output can be correctly formatted and parsed as JSON, which is critical for applications that rely on structured data formats.

from phoenix.experiments import JSONParsable

# This defines a code evaluator that checks if the output is JSON-parsable
json_parsable_evaluator = JSONParsable()

This evaluator checks whether the output matches a specified regex pattern. It’s ideal for validating outputs that need to conform to a specific format, such as phone numbers, email addresses, or other structured data that can be described with a regular expression.

from phoenix.experiments import MatchesRegex

# This defines a code evaluator that checks if the output contains 
# a valid phone number format
phone_number_evaluator = MatchesRegex(
    pattern=r"\d{3}-\d{3}-\d{4}",
    name="valid-phone-number"
)

This evaluator checks if a specific keyword is present in the output of an experiment run. It’s helpful for validating that certain key phrases or terms appear in the output, which might be essential for tasks like content generation or response validation.

from phoenix.experiments import ContainsKeyword

# This defines a code evaluator that checks for the presence of 
# the keyword "success"
contains_keyword = ContainsKeyword(keyword="success")

This evaluator checks if any of a list of keywords are present in the output. It’s useful when you want to validate that at least one of several important terms or phrases appears in the generated output, providing flexibility in how success is defined.

from phoenix.experiments import ContainsAnyKeyword

# This defines a code evaluator that checks if any of the keywords 
# "error", "failed", or "warning" are present
contains_any_keyword = ContainsAnyKeyword(keywords=["error", "failed", "warning"])

This evaluator ensures that all specified keywords are present in the output of an experiment run. It's useful for cases where the presence of multiple key phrases is essential, such as ensuring all required elements of a response or content piece are included.

from phoenix.experiments import ContainsAllKeywords

# This defines a code evaluator that checks if all of the keywords 
# "foo" and "bar" are present
contains_all_keywords = ContainsAllKeywords(keywords=["foo", "bar"])

Advanced: Evaluator as a Class

This is an alternative you can use if you'd prefer to use object oriented programming instead of functional programming.

Eval Class Inputs

The Eval argument values are supported below:

Parameter name
Description
Example

input

experiment run input

def eval(input): ...

output

experiment run output

def eval(output): ...

dataset_row

the entire row of the data, including every column as dictionary key

def eval(expected): ...

metadata

experiment metadata

def eval(metadata): ...

class ExampleAll(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using All Inputs")
        
class ExampleDatasetrow(Evaluator):
    def evaluate(self, dataset_row, **kwargs) -> EvaluationResult:  
        print("Evaluator Using dataset_row ")  
        
class ExampleInput(Evaluator):
    def evaluate(self, input, **kwargs) -> EvaluationResult:  
        print("Evaluator Using Input") 
        
class ExampleOutput(Evaluator):
    def evaluate(self, output, **kwargs) -> EvaluationResult:  
        print("Evaluator Using Output") 

EvaluationResult Outputs

The EvaluationResult results can be a score, label, tuple (score, label, explanation) or a Class EvaluationResult

Return Type
Description

EvaluationResult

Score, label and explanation

float

Score output

string

Label string output

class ExampleResult(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using All Inputs")
        return(EvaluationResult(score=score, label=label, explanation=explanation)
        
class ExampleScore(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator Using A float")
        return 1.0
      
class ExampleLabel(Evaluator):
    def evaluate(self, input, output, dataset_row, metadata, **kwargs) -> EvaluationResult:  
        print("Evaluator label")
        return "good"  

Code Evaluator as Class

from arize.experimental.datasets.experiments.evaluators.base import Evaluator, EvaluationResult

class MatchesExpected(Evaluator):
    annotator_kind = "CODE"
    name = "matches_expected"

    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:        
        expected_output = dataset_row.get("expected")
        label = expected_output == output
        score = float(label)
        return EvaluationResult(score=score, label=label)

    async def async_evaluate(self, _: Example, exp_run: ExperimentRun) -> EvaluationResult:
        return self.evaluate(_, exp_run)

You can run this class using the following:

arize_client.run_experiment(
    space_id="",
    dataset_name="", 
    task="", 
    evaluators=[MatchesExpected()], 
    experiment_name=""
)

LLM Evaluator as Class Example

Here's an example of a LLM evaluator that checks for hallucinations in the model output:

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    llm_classify,
)
from phoenix.experiments.types import EvaluationResult
from openai import OpenAIModel

class HallucinationEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:  
        print("Evaluating outputs")
        expected_output = dataset_row["attributes.llm.output_messages"]
        
        # Create a DataFrame with the actual and expected outputs
        df_in = pd.DataFrame(
            {"selected_output": output, "expected_output": expected_output}, index=[0]
        )
        # Run the LLM classification
        expect_df = llm_classify(
            dataframe=df_in,
            template=HALLUCINATION_PROMPT_TEMPLATE,
            model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
            rails=HALLUCINATION_PROMPT_RAILS_MAP,
            provide_explanation=True,
        )
        label = expect_df["label"][0]
        score = 1 if label == rails[1] else 0  # Score 1 if output is incorrect
        explanation = expect_df["explanation"][0]
        
        # Return the evaluation result
        return EvaluationResult(score=score, label=label, explanation=explanation)

In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.

To use , use the following import statement:

To run the experiment, you can load the evaluator into as following:

Need help writing a custom evaluator template? Use ✨ to write one for you

Users have the option to run an experiment by creating an evaluator that inherits from the base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an dataclass.

🧪
datasets & experiments overview page
EvaluationResult class
run_experiment
✨
Copilot
Evaluator(ABC)
EvaluationResult