Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
  • 🤖Agents
    • Agent Workflow Patterns
      • AutoGen
      • CrewAI
      • Google GenAI SDK (Manual Orchestration)
      • OpenAI Agents
  • 📜Guides & FAQ
    • Frequently Asked Questions
    • Contribute to Phoenix
  • 🪜Fundamentals
    • Agents Hub
    • LLM Evals Hub
    • LLM Ops Hub
  • 🔭Tracing
    • What are Traces
    • How Tracing Works
    • FAQs: Tracing
  • 📃Prompt Engineering
    • Prompts Concepts
  • 🗄️Datasets and Experiments
    • Datasets Concepts
  • 🧠Evaluation
    • Evaluators
    • Eval Data Types
    • Evals With Explanations
    • LLM as a Judge
    • Custom Task Evaluation
  • 🔍Retrieval & Infrences
    • Retrieval with Embeddings
    • Benchmarking Retrieval
  • Retrieval Evals on Document Chunks
  • 🌌Inferences
    • Inferences Concepts
  • 📚Resources
    • Github
  • OpenInference
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page

Was this helpful?

  1. Evaluation

Evals With Explanations

PreviousEval Data TypesNextLLM as a Judge

Was this helpful?

It can be hard to understand in many cases why an LLM responds in a specific way. The explanation feature of Phoneix allows you to get a Eval output and an explanation from the LLM at the same time. We have found this incredibly useful for debugging LLM Evals.

from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

#The rails is used to hold the output to specific values based on the template
#It will remove text such as ",,," or "..."
#Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    
)
#relevance_classifications is a Dataframe with columns 'label' and 'explanation'

The flag above can be set with any of the templates or your own custom templates. The example below is from a relevance Evaluation.

🧠
Google Colaboratory
See "Classifications with Explanations Section"
Logo