Evaluate Voice Applications

Currently, voice evaluations are supported exclusively with OpenAI models. Support for additional models is planned and will be available soon.

This guide provides instructions on how to evaluate voice applications using OpenAI models within the Phoenix framework. The example notebook linked below demonstrates the process of configuring and running evaluations.

Prerequisites

  1. Phoenix Installation: Make sure the phoenix package is installed in your Python environment.

  2. OpenAI API Key: Obtain an API key for the OpenAI model you plan to use.

  3. Audio Data: Prepare the audio data required for evaluation. This can be in the form of raw audio bytes, base64-encoded strings, or URLs pointing to audio files. If you have existing data in Arize, you can use our export client to retrieve it.

  4. Python Environment: Ensure you are using Python version 3.7 or higher.

Steps to Evaluate

1. Set Up the Model

Use the OpenAIModel class to define the OpenAI model for evaluation. Replace the placeholder API key with your own.

from phoenix.evals import OpenAIModel

model = OpenAIModel(model="gpt-4o-audio-preview", api_key="your_openai_api_key")

2. Define the Template

Templates are used to configure prompts sent to the OpenAI model, ensuring that the task is clearly defined and the model's responses are constrained to valid outputs. Templates consist of rails (the set of valid responses) and a sequence of prompt parts that define the type and content of the input or instructions.

In addition to custom templates, we offer an out-of-the-box template for emotion detection. This template streamlines setup, allowing you to start classifying audio with minimal configuration.

Below is an example template for tone classification.

from phoenix.evals.templates import (
    ClassificationTemplate,
    PromptPartContentType,
    PromptPartTemplate,
)

# Define valid classification labels (rails)
TONE_EMOTION_RAILS = ["positive", "neutral", "negative"]

# Create the classification template
template = ClassificationTemplate(
    rails=TONE_EMOTION_RAILS,  # Specify the valid output labels
    template=[
        # Prompt part 1: Task description
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template="""
            You are a helpful AI bot that checks for the tone of the audio.
            Analyze the audio file and determine the tone (e.g., positive, neutral, negative).
            Your evaluation should provide a multiclass label from the following options: ['positive', 'neutral', 'negative'].
            
            Here is the audio:
            """,
        ),
        # Prompt part 2: Insert the audio data
        PromptPartTemplate(
            content_type=PromptPartContentType.AUDIO,
            template="{audio}",  # Placeholder for the audio content
        ),
        # Prompt part 3: Define the response format
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template="""
            Your response must be a string, either positive, neutral, or negative, and should not contain any text or characters aside from that.
            """,
        ),
    ],
)

How It Works

  1. Prompt Parts

    • Part 1: Provides task instructions and specifies valid response labels.

    • Part 2: Dynamically inserts the audio data for analysis using the placeholder. You'll want to ensure that the prompt variable you choose corresponds to the column that holds your base64-encoded audio data.

    • Part 3: Ensures the model outputs a response in the desired format (a single-word string: positive, neutral, or negative).

  2. Rails

    • Rails define the set of valid outputs for the classification task: ["positive", "neutral", "negative"].

    • Any response outside this set can be flagged as invalid.

This modular approach ensures flexibility, allowing you to reuse and adapt templates for different use cases or models.

If you are evaluating text (e.g., a transcript) instead of audio, you can directly use a string prompt without needing dynamic placeholders.

3. Prepare the Data Processor (Optional)

Using a data processor with Phoenix enables parallel processing of your audio data, improving efficiency and scalability. A data processor is responsible for transforming raw audio data into base64-encoded strings, which can then be utilized by your models.

Processor Requirements

To ensure compatibility with Phoenix, your data processor must meet the following criteria:

  1. Consistent Input and Output Types

    • The input and output of the processor must maintain the same type.

    • For example: If you are processing a DataFrame, the input would be a series (a single row), and the output would also be a series (updated row with the encoded audio).

  2. Audio Link Processing

    • The processor must fetch audio from a provided link (either from cloud storage or local storage) and produce a base64-encoded string.

  3. Column Assignment Consistency

    • The encoded string must be assigned to the same column referenced in your prompt template.

    • For example, if you are using the EMOTION_AUDIO_TEMPLATE, the base64-encoded audio string should be assigned to the "audio" column.

Example: Fetching and Encoding Audio from Google Cloud Storage

Below is an example data processor that demonstrates how to fetch audio from Google Cloud Storage, encode it as a base64 string, and assign it to the appropriate column in the dataframe:

async def async_fetch_gcloud_data(row: pd.Series) -> pd.Series:
  """
    Fetches data from a Google Cloud Storage URL and returns the content as a base64-encoded string.
    """
  token = None
  try:
      # Execute the gcloud command to fetch the access token
      output = await asyncio.create_subprocess_exec(
          "gcloud",
          "auth",
          "print-access-token",
          stdout=asyncio.subprocess.PIPE,
          stderr=asyncio.subprocess.PIPE,
      )
      stdout, stderr = await output.communicate()
      if output.returncode != 0:
          raise RuntimeError(f"Error executing gcloud command: {stderr.decode('UTF-8').strip()}")
      token = stdout.decode("UTF-8").strip()

      # Ensure the token is not empty or None
      if not token:
          raise ValueError("Failed to retrieve a valid access token. Token is empty.")

  except Exception as e:
      # Catch any other exceptions and re-raise them with additional context
      raise RuntimeError(f"An unexpected error occurred: {str(e)}")

  # Set the token in the header
  gcloud_header = {"Authorization": f"Bearer {token}"}

  # Must ensure that the url begins with storage.googleapis..., rather than store.cloud.google...
  url = row["attributes.input.audio.url"]
  G_API_HOST = "https://storage.googleapis.com/"
  not_googleapis = url.startswith("https://storage.cloud.google.com/") or url.startswith("gs://")
  g_api_url = (
      url.replace("https://storage.cloud.google.com/", G_API_HOST)
      if url and not_googleapis
      else url
  )

  # Get a response back, present the status
  async with aiohttp.ClientSession() as session:
      async with session.get(g_api_url, headers=gcloud_header) as response:
          response.raise_for_status()
          content = await response.read()

  encoded_string = base64.b64encode(content).decode("utf-8")

  row["audio"] = encoded_string

  return row

If your audio data is already in base64 format as an encoded string, you can skip this step.

4. Perform the Evaluation

To run an evaluation, use the llm_classify function. This function accepts a DataFrame, a list of audio URLs, or raw audio bytes as input. In the example below, data is exported directly from Arize to perform the evaluation.

from phoenix.evals.classify import llm_classify
import pandas as pd

# Example DataFrame
df = client.export_model_to_df(
    space_id='SPACE_ID',
    model_id='PROJECT_NAME',
    environment=Environments.TRACING,
    start_time=datetime.fromisoformat('2024-12-23T07:00:00.000+00:00'),
    end_time=datetime.fromisoformat('2024-12-31T06:59:59.999+00:00'),
    )

# Run the evaluation
results = llm_classify(
        model=model,
        data=eval_data,
        data_processor=async_fetch_gcloud_data,
        template=EMOTION_PROMPT_TEMPLATE, 
        rails=EMOTION_AUDIO_RAILS,
        provide_explanation=True,
    )

Considerations

  • Data Processors: Custom functions can transform audio paths, URLs, or raw data to the required format.

  • Templates: Modify the templates to fit your specific evaluation needs.

    • Remember: template variables should have the same naming as the desired columns in the dataframe

  • Explanations: Enable provide_explanation=True to include detailed reasoning in the evaluation output.

Examples

Emotion Dection

Custom Template

Last updated

Copyright © 2023 Arize AI, Inc