LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Arize AI
  • Quickstarts
  • ✨Arize Copilot
  • Arize AI for Agents
  • Concepts
    • Agent Evaluation
    • Tracing
      • What is OpenTelemetry?
      • What is OpenInference?
      • Openinference Semantic Conventions
    • Evaluation
  • 🧪Develop
    • Quickstart: Experiments
    • Datasets
      • Create a dataset
      • Update a dataset
      • Export a dataset
    • Experiments
      • Run experiments
      • Run experiments with code
        • Experiments SDK differences in AX vs Phoenix
        • Log experiment results via SDK
      • Evaluate experiments
      • Evaluate experiment with code
      • CI/CD with experiments
        • Github Action Basics
        • Gitlab CI/CD Basics
      • Download experiment
    • Prompt Playground
      • Use tool calling
      • Use image inputs
      • Replay spans
      • Compare prompts side-by-side
      • Load a dataset into playground
      • Save playground outputs as an experiment
      • ✨Copilot: prompt builder
    • Playground Integrations
      • OpenAI
      • Azure OpenAI
      • AWS Bedrock
      • VertexAI
      • Custom LLM Models
    • Prompt Hub
  • 🧠Evaluate
    • Online Evals
      • Run evaluations in the UI
      • Run evaluations with code
      • Test LLM evaluator in playground
      • View task details & logs
      • ✨Copilot: Eval Builder
      • ✨Copilot: Eval Analysis
      • ✨Copilot: RAG Analysis
    • Experiment Evals
    • LLM as a Judge
      • Custom Eval Templates
      • Arize Templates
        • Agent Tool Calling
        • Agent Tool Selection
        • Agent Parameter Extraction
        • Agent Path Convergence
        • Agent Planning
        • Agent Reflection
        • Hallucinations
        • Q&A on Retrieved Data
        • Summarization
        • Code Generation
        • Toxicity
        • AI vs Human (Groundtruth)
        • Citation
        • User Frustration
        • SQL Generation
    • Code Evaluations
    • Human Annotations
  • 🔭Observe
    • Quickstart: Tracing
    • Tracing
      • Setup tracing
      • Trace manually
        • Trace inputs and outputs
        • Trace function calls
        • Trace LLM, Retriever and Tool Spans
        • Trace prompt templates & variables
        • Trace as Inferences
        • Send Traces from Phoenix -> Arize
        • Advanced Tracing (OTEL) Examples
      • Add metadata
        • Add events, exceptions and status
        • Logging Latent Metadata
        • Add attributes, metadata and tags
        • Send data to a specific project
        • Get the current span context and tracer
      • Configure tracing options
        • Configure OTEL tracer
        • Mask span attributes
        • Redact sensitive data from traces
        • Instrument with OpenInference helpers
      • Query traces
        • Filter Traces
          • Time Filtering
        • Export Traces
        • ✨AI Powered Search & Filter
        • ✨AI Powered Trace Analysis
        • ✨AI Span Analysis & Evaluation
    • Tracing Integrations
      • OpenAI
      • OpenAI Agents SDK
      • LlamaIndex
      • LlamaIndex Workflows
      • LangChain
      • LangGraph
      • Hugging Face smolagents
      • Autogen
      • Google GenAI (Gemini)
      • Model Context Protocol (MCP)
      • Vertex AI
      • Amazon Bedrock
      • Amazon Bedrock Agents
      • MistralAI
      • Anthropic
      • LangFlow
      • Haystack
      • LiteLLM
      • CrewAI
      • Groq
      • DSPy
      • Guardrails AI
      • Prompt flow
      • Vercel AI SDK
      • Llama
      • Together AI
      • OpenTelemetry (arize-otel)
      • BeeAI
    • Evals on Traces
    • Guardrails
    • Sessions
    • Dashboards
      • Dashboard Widgets
      • Tracking Token Usage
      • ✨Copilot: Dashboard Widget Creation
    • Monitors
      • Integrations: Monitors
        • Slack
          • Manual Setup
        • OpsGenie
        • PagerDuty
      • LLM Red Teaming
    • Custom Metrics & Analytics
      • Arize Query Language Syntax
        • Conditionals and Filters
        • All Operators
        • All Functions
      • Custom Metric Examples
      • ✨Copilot: ArizeQL Generator
  • 📈Machine Learning
    • Machine Learning
      • User Guide: ML
      • Quickstart: ML
      • Concepts: ML
        • What Is A Model Schema
        • Delayed Actuals and Tags
        • ML Glossary
      • How To: ML
        • Upload Data to Arize
          • Pandas SDK Example
          • Local File Upload
            • File Upload FAQ
          • Table Ingestion Tuning
          • Wildcard Paths for Cloud Storage
          • Troubleshoot Data Upload
          • Sending Data FAQ
        • Monitors
          • ML Monitor Types
          • Configure Monitors
            • Notifications Providers
          • Programmatically Create Monitors
          • Best Practices for Monitors
        • Dashboards
          • Dashboard Widgets
          • Dashboard Templates
            • Model Performance
            • Pre-Production Performance
            • Feature Analysis
            • Drift
          • Programmatically Create Dashboards
        • Performance Tracing
          • Time Filtering
          • ✨Copilot: Performance Insights
        • Drift Tracing
          • ✨Copilot: Drift Insights
          • Data Distribution Visualization
          • Embeddings for Tabular Data (Multivariate Drift)
        • Custom Metrics
          • Arize Query Language Syntax
            • Conditionals and Filters
            • All Operators
            • All Functions
          • Custom Metric Examples
          • Custom Metrics Query Language
          • ✨Copilot: ArizeQL Generator
        • Troubleshoot Data Quality
          • ✨Copilot: Data Quality Insights
        • Explainability
          • Interpreting & Analyzing Feature Importance Values
          • SHAP
          • Surrogate Model
          • Explainability FAQ
          • Model Explainability
        • Bias Tracing (Fairness)
        • Export Data to Notebook
        • Automate Model Retraining
        • ML FAQ
      • Use Cases: ML
        • Binary Classification
          • Fraud
          • Insurance
        • Multi-Class Classification
        • Regression
          • Lending
          • Customer Lifetime Value
          • Click-Through Rate
        • Timeseries Forecasting
          • Demand Forecasting
          • Churn Forecasting
        • Ranking
          • Collaborative Filtering
          • Search Ranking
        • Natural Language Processing (NLP)
        • Common Industry Use Cases
      • Integrations: ML
        • Google BigQuery
          • GBQ Views
          • Google BigQuery FAQ
        • Snowflake
          • Snowflake Permissions Configuration
        • Databricks
        • Google Cloud Storage (GCS)
        • Azure Blob Storage
        • AWS S3
          • Private Image Link Access Via AWS S3
        • Kafka
        • Airflow Retrain
        • Amazon EventBridge Retrain
        • MLOps Partners
          • Algorithmia
          • Anyscale
          • Azure & Databricks
          • BentoML
          • CML (DVC)
          • Deepnote
          • Feast
          • Google Cloud ML
          • Hugging Face
          • LangChain 🦜🔗
          • MLflow
          • Neptune
          • Paperspace
          • PySpark
          • Ray Serve (Anyscale)
          • SageMaker
            • Batch
            • RealTime
            • Notebook Instance with Greater than 20GB of Data
          • Spell
          • UbiOps
          • Weights & Biases
      • API Reference: ML
        • Python SDK
          • Pandas Batch Logging
            • Client
            • log
            • Schema
            • TypedColumns
            • EmbeddingColumnNames
            • ObjectDetectionColumnNames
            • PromptTemplateColumnNames
            • LLMConfigColumnNames
            • LLMRunMetadataColumnNames
            • NLP_Metrics
            • AutoEmbeddings
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
          • Single Record Logging
            • Client
            • log
            • TypedValue
            • Ranking
            • Multi-Class
            • Object Detection
            • Embedding
            • LLMRunMetadata
            • utils.types.ModelTypes
            • utils.types.Metrics
            • utils.types.Environments
        • Java SDK
          • Constructor
          • log
          • bulkLog
          • logValidationRecords
          • logTrainingRecords
        • R SDK
          • Client$new()
          • Client$log()
        • Rest API
    • Computer Vision
      • How to: CV
        • Generate Embeddings
          • How to Generate Your Own Embedding
          • Let Arize Generate Your Embeddings
        • Embedding & Cluster Analyzer
        • ✨Copilot: Embedding Summarization
        • Similarity Search
        • Embedding Drift
        • Embeddings FAQ
      • Integrations: CV
      • Use Cases: CV
        • Image Classification
        • Image Segmentation
        • Object Detection
      • API Reference: CV
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Key features
  • Add annotations in the UI
  • Viewing Annotations on a Trace
  • Create Annotations via API
  • Annotations Dataframe Schema
  • Labeling Queues
  • Inviting an Annotator
  • Creating a Labeling Queue
  • Labeling data as an annotator

Was this helpful?

  1. Evaluate

Human Annotations

Use human feedback to curate datasets for testing

Last updated 1 day ago

Was this helpful?

Annotations are custom labels that can be added to traces in LLM applications. AI engineers can use annotations to manually label data, curate datasets for experimentation, and use human feedback instead of code or LLM evals.

Annotations are great for finding examples where LLM evals and humans disagree for further review. Often times, subject matter experts (e.g. a doctor or lawyer) are needed to determine correctness of an answer. Customers also log feedback directly using labeling queues or our annotations API.

Key features

  1. Add annotations in the UI

  2. Create Annotations via API

  3. Labeling Queues

Add annotations in the UI

Annotations are labels that can be applied at a per-span level for LLM use cases. Annotations are defined by a config for the annotation (label, score). Those annotations are then available for any future annotation on the model.

Unstructured text annotations (notes) can also be continuously added.

Viewing Annotations on a Trace

Users can save and view annotations on a trace and also filter on them.

Create Annotations via API

Annotations can also be performed via our Python SDK. Use the log_annotations_sync function as part of our Python SDK to attach human feedback, corrections, or other annotations to specific spans (traces). The code below assumes that you have annotation data available in an annotations_dataframe object. It also assumes you have the relevant context.span_id for the span you want to annotate.

Note: Annotations can be applied on spans up to 14 days prior to the current day. To apply annotations beyond this lookback window, please reach out to support@arize.com

Logging the annotation

Important Prerequisite: Before logging annotations using the SDK, you must first configure the annotation definition within the Arize application UI.

  1. Navigate to a trace within your project in the Arize platform.

  2. Click the "Annotate" button to open the annotation panel.

  3. Click "Add Annotation".

  4. Define the <annotation_name> exactly as you will use it in your SDK code.

  5. Select the appropriate Type (Label for categorical strings, Score for numerical values).

  6. If using Label, define the specific allowed label strings that the SDK can send for this annotation name. If using Score, you can optionally define score ranges.

Note: Only annotations matching a pre-configured (1) name and (2) type/labels in the UI can be successfully logged via the SDK.

Here is how you can log your annotations in real-time to the Arize platform with the python SDK:

Import Packages and Setup Arize Client

import os
import pandas as pd
from arize.pandas.logger import Client

API_KEY = os.environ.get("ARIZE_API_KEY") # You can get this form the UI
SPACE_ID = os.environ.get("ARIZE_SPACE_ID") # You can get this form the UI
DEVELOPER_KEY = os.environ.get('ARIZE_DEVELOPER_KEY') # Needed for sync functions
PROJECT_NAME = "YOUR_PROJECT_NAME" # Replace with your project name

print(f"\n🚀 Initializing Arize client for space '{SPACE_ID}'...")
try:
    arize_client = Client(
        space_id=SPACE_ID,
        api_key=API_KEY,
        developer_key=DEVELOPER_KEY
    )
    print("✅ Arize client initialized.")
except Exception as e:
    print(f"❌ Error initializing client: {e}")
    exit()

Create Sample Data (replace with your actual data)

TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543" # Replace with your span ID
annotation_data = {
    "context.span_id": [TARGET_SPAN_ID],
    "annotation.quality.label": ["good"],
    "annotation.relevance.label": ["relevant"],
    "annotation.relevance.updated_by": ["human_annotator_1"],
    "annotation.sentiment_score.score": [4.5],
    "annotation.notes": ["User confirmed the summary was helpful."],
}
annotations_df = pd.DataFrame(annotation_data)

Log Annotation

try:
    response = arize_client.log_annotations(
        dataframe=annotations_df,
        project_name=PROJECT_NAME,
        validate=True,  # Keep validation enabled
        verbose=True    # Enable detailed SDK logs, especially when first trying
    )

    if response:
        print("\n✅ Successfully logged annotations!")
        print(f"   Annotation Records Updated: {response.records_updated}")
    else:
        print("\n⚠️ Annotation logging call completed, but no response received (check SDK logs/platform).")

except Exception as e:
    print(f"\n❌ An error occurred during annotation logging: {e}") 

Annotations Dataframe Schema

The annotations_dataframe requires the following columns:

  1. context.span_id: The unique identifier of the span to which the annotations should be attached.

  2. Annotation Columns: Columns following the pattern annotation.<annotation_name>.<suffix> where:

  • <annotation_name>: A name for your annotation (e.g., quality, correctness, sentiment). Should be alphanumeric characters and underscores.

  • <suffix>: Defines the type and metadata of the annotation. Valid suffixes are:

    • You must provide at least one annotation.<annotation_name>.label or annotation.<annotation_name>.score column for each annotation you want to log.

    • label: For categorical annotations (e.g., "good", "bad", "spam"). The value should be a string.

    • score: For numerical annotations (e.g., a rating from 1-5). The value should be numeric (int or float).

    • updated_by (Optional): A string indicating who made the annotation (e.g., "user_id_123", "annotator_team_a"). If not provided, the SDK automatically sets this to "SDK Logger".

    • updated_at (Optional): A timestamp indicating when the annotation was made, represented as milliseconds since the Unix epoch (integer). If not provided, the SDK automatically sets this to the current UTC time.

  • annotation.notes (Optional): A column containing free-form text notes that apply to the entire span, not a specific annotation label/score. The value should be a string. The SDK will handle formatting this correctly for storage.

An example annotation data dictionary would look like:

# Assume TARGET_SPAN_ID holds the ID of the span you want to annotate
TARGET_SPAN_ID = "3461a49d-e0c3-469a-837b-d83f4a606543"

annotation_data = {
    "context.span_id": [TARGET_SPAN_ID],
    # Annotation 1: Categorical label, let SDK autogenerate updated_by/updated_at
    "annotation.quality.label": ["good"],
    # Annotation 2: Categorical label, manually set updated_by
    "annotation.relevance.label": ["relevant"],
    "annotation.relevance.updated_by": ["human_annotator_1"],
    # Annotation 3: Numerical score, let SDK autogenerate updated_by/updated_at
    "annotation.sentiment_score.score": [4.5],
    # Optional notes for the span
    "annotation.notes": ["User confirmed the summary was helpful."],
}
annotations_df = pd.DataFrame(annotation_data)

Labeling Queues

Labeling queues are sets of data you would like subject matter experts/3rd parties to label or score on any criteria you specify. You can use these annotations to create golden datasets from experts for fine tuning, and find examples where LLM evals and humans disagree.

What you need to use Labeling queues is:

  1. A dataset you want to annotate

  2. Annotator users in your space

    1. Note: you can assign annotators OR members in your space to a labeling queue. Annotators will see a restricted view of the platform (see below)

  3. Annotation criteria

Inviting an Annotator

In the settings page, you can invite your annotators by adding them as users with the account role as Annotator. They will receive an email to be added to your space and set their password.

Creating a Labeling Queue

After you have created a dataset of traces you want to evaluate, you can create a labeling queue and distribute them to your annotation team. Then, you can view your records and annotations provided.

The columns that annotators label will appear on datasets as name spaced annotationcolumns (i.e. annotation.hallucination). The latest annotation value for a specific row will be namespaced with latest.userannotation, which can be helpful to use for experiments if you have multiple annotators labeling a dataset.

Labeling data as an annotator

Annotators see the labeling queues they have been assigned, and the data they need to annotate, along with the label or score they need to provide in the top right. Your datasets can contain text, images, and links. Annotators can leave notes, and use the keyboard shortcuts to provide annotations faster.

🧠
Annotations API
Adding an annotation to a span