Phoenix
TypeScript APIPython APICommunityGitHubPhoenix Cloud
  • Documentation
  • Self-Hosting
  • Cookbooks
  • Learn
  • Integrations
  • SDK and API Reference
  • Release Notes
  • Overview
  • LLM Providers
    • Amazon Bedrock
      • Amazon Bedrock Tracing
      • Amazon Bedrock Evals
      • Amazon Bedrock Agents Tracing
    • Anthropic
      • Anthropic Tracing
      • Anthropic Evals
    • Google Gen AI
      • Google GenAI Tracing
      • Gemini Evals
    • LiteLLM
      • LiteLLM Tracing
      • LiteLLM Evals
    • MistralAI
      • MistralAI Tracing
      • MistralAI Evals
    • Groq
      • Groq Tracing
    • OpenAI
      • OpenAI Tracing
      • OpenAI Evals
      • OpenAI Agents SDK Tracing
      • OpenAI Node.js SDK
    • VertexAI
      • VertexAI Tracing
      • VertexAI Evals
  • Frameworks
    • Agno
      • Agno Tracing
    • AutoGen
      • AutoGen Tracing
    • BeeAI
      • BeeAI Tracing (JS)
    • CrewAI
      • CrewAI Tracing
    • DSPy
      • DSPy Tracing
    • Flowise
      • Flowise Tracing
    • Guardrails AI
      • Guardrails AI Tracing
    • Haystack
      • Haystack Tracing
    • Hugging Face smolagents
      • smolagents Tracing
    • Instructor
      • Instructor Tracing
    • LlamaIndex
      • LlamaIndex Tracing
      • LlamaIndex Workflows Tracing
    • LangChain
      • LangChain Tracing
      • LangChain.js
    • LangGraph
      • LangGraph Tracing
  • LangFlow
    • LangFlow Tracing
  • Model Context Protocol
    • Phoenix MCP Server
    • MCP Tracing
  • Prompt Flow
    • Prompt Flow Tracing
  • Vercel
    • Vercel AI SDK Tracing (JS)
  • Evaluation Libraries
    • Cleanlab
    • Ragas
  • Vector Databases
    • MongoDB
    • Pinecone
    • Qdrant
    • Weaviate
    • Zilliz / Milvus
Powered by GitBook

Platform

  • Tracing
  • Prompts
  • Datasets and Experiments
  • Evals

Software

  • Python Client
  • TypeScript Client
  • Phoenix Evals
  • Phoenix Otel

Resources

  • Container Images
  • X
  • Blue Sky
  • Blog

Integrations

  • OpenTelemetry
  • AI Providers

© 2025 Arize AI

On this page

Was this helpful?

  1. Evaluation Libraries

Cleanlab

PreviousVercel AI SDK Tracing (JS)NextRagas

Was this helpful?

Ensuring the reliability and accuracy of LLM-generated responses is a critical challenge for production AI systems. Poor-quality training data, ambiguous labels, and untrustworthy outputs can degrade model performance and lead to unreliable results.

is a tool that estimates the trustworthiness of an LLM response. It provides a confidence score that helps detect hallucinations, ambiguous responses, and potential misinterpretations. This enables teams to flag unreliable outputs and improve the robustness of their AI systems.

This guide demonstrates how to integrate Cleanlab’s Trustworthy Language Model (TLM) with Phoenix to systematically identify and improve low-quality LLM responses. By leveraging TLM for automated data quality assessment and Phoenix for response analysis, you can build more robust and trustworthy AI applications.

Specifically, this tutorial will walk through:

  • Evaluating LLM-generated responses for trustworthiness.

  • Using Cleanlab TLM to score and flag untrustworthy responses.

  • Leveraging Phoenix for tracing and visualizing response evaluations.

Key Implementation Steps for generating evals w/ TLM

  1. Install Dependencies, Set up API Keys, Obtain LLM Reponses + Trace in Phoenix

  2. Download Trace Dataset

import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name=[your_project_name])
spans_df.head()
  1. Prep data from trace dataset

# Create a new DataFrame with input and output columns
eval_df = spans_df[["context.span_id", "attributes.input.value", "attributes.output.value"]].copy()
eval_df.set_index("context.span_id", inplace=True)

# Combine system and user prompts from the traces
def get_prompt(input_value):
    if isinstance(input_value, str):
        input_value = json.loads(input_value)
    system_prompt = input_value["messages"][0]["content"]
    user_prompt = input_value["messages"][1]["content"]
    return system_prompt + "\n" + user_prompt

# Get the responses from the traces
def get_response(output_value):
    if isinstance(output_value, str):
        output_value = json.loads(output_value)
    return output_value["choices"][0]["message"]["content"]

# Create a list of prompts and associated responses
prompts = [get_prompt(input_value) for input_value in eval_df["attributes.input.value"]]
responses = [get_response(output_value) for output_value in eval_df["attributes.output.value"]]

eval_df["prompt"] = prompts
eval_df["response"] = responses
  1. Setup TLM & Evaluate each pair

from cleanlab_tlm import TLM

tlm = TLM(options={"log": ["explanation"]})

# Evaluate each of the prompt, response pairs using TLM
evaluations = tlm.get_trustworthiness_score(prompts, responses)

# Extract the trustworthiness scores and explanations from the evaluations
trust_scores = [entry["trustworthiness_score"] for entry in evaluations]
explanations = [entry["log"]["explanation"] for entry in evaluations]

# Add the trust scores and explanations to the DataFrame
eval_df["score"] = trust_scores
eval_df["explanation"] = explanations
  1. Upload Evals to Phoenix

from phoenix.trace import SpanEvaluations

eval_df["score"] = eval_df["score"].astype(float)
eval_df["explanation"] = eval_df["explanation"].astype(str)

px.Client().log_evaluations(SpanEvaluations(eval_name="Trustworthiness", dataframe=eval_df))

Check out the full tutorial here:

Cleanlab TLM
Google Colab
Logo