Evaluate Voice Applications
Currently, voice evaluations are supported exclusively with OpenAI models. Support for additional models is planned and will be available soon.
This guide provides instructions on how to evaluate voice applications using OpenAI models within the Phoenix framework. The example notebook linked below demonstrates the process of configuring and running evaluations.
Prerequisites
Phoenix Installation: Make sure the
phoenix
package is installed in your Python environment.OpenAI API Key: Obtain an API key for the OpenAI model you plan to use.
Audio Data: Prepare the audio data required for evaluation. This can be in the form of raw audio bytes, base64-encoded strings, or URLs pointing to audio files. If you have existing data in Arize, you can use our export client to retrieve it.
Python Environment: Ensure you are using Python version 3.7 or higher.
Steps to Evaluate
1. Set Up the Model
Use the OpenAIModel
class to define the OpenAI model for evaluation. Replace the placeholder API key with your own.
2. Define the Template
Templates are used to configure prompts sent to the OpenAI model, ensuring that the task is clearly defined and the model's responses are constrained to valid outputs. Templates consist of rails (the set of valid responses) and a sequence of prompt parts that define the type and content of the input or instructions.
In addition to custom templates, we offer an out-of-the-box template for emotion detection. This template streamlines setup, allowing you to start classifying audio with minimal configuration.
Below is an example template for tone classification.
How It Works
Prompt Parts
Part 1: Provides task instructions and specifies valid response labels.
Part 2: Dynamically inserts the audio data for analysis using the placeholder. You'll want to ensure that the prompt variable you choose corresponds to the column that holds your base64-encoded audio data.
Part 3: Ensures the model outputs a response in the desired format (a single-word string:
positive
,neutral
, ornegative
).
Rails
Rails define the set of valid outputs for the classification task:
["positive", "neutral", "negative"]
.Any response outside this set can be flagged as invalid.
This modular approach ensures flexibility, allowing you to reuse and adapt templates for different use cases or models.
If you are evaluating text (e.g., a transcript) instead of audio, you can directly use a string prompt without needing dynamic placeholders.
3. Prepare the Data Processor (Optional)
Using a data processor with Phoenix enables parallel processing of your audio data, improving efficiency and scalability. A data processor is responsible for transforming raw audio data into base64-encoded strings, which can then be utilized by your models.
Processor Requirements
To ensure compatibility with Phoenix, your data processor must meet the following criteria:
Consistent Input and Output Types
The input and output of the processor must maintain the same type.
For example: If you are processing a DataFrame, the input would be a series (a single row), and the output would also be a series (updated row with the encoded audio).
Audio Link Processing
The processor must fetch audio from a provided link (either from cloud storage or local storage) and produce a base64-encoded string.
Column Assignment Consistency
The encoded string must be assigned to the same column referenced in your prompt template.
For example, if you are using the
EMOTION_AUDIO_TEMPLATE
, the base64-encoded audio string should be assigned to the"audio"
column.
Example: Fetching and Encoding Audio from Google Cloud Storage
Below is an example data processor that demonstrates how to fetch audio from Google Cloud Storage, encode it as a base64 string, and assign it to the appropriate column in the dataframe:
If your audio data is already in base64 format as an encoded string, you can skip this step.
4. Perform the Evaluation
To run an evaluation, use the llm_classify
function. This function accepts a DataFrame, a list of audio URLs, or raw audio bytes as input. In the example below, data is exported directly from Arize to perform the evaluation.
Considerations
Data Processors: Custom functions can transform audio paths, URLs, or raw data to the required format.
Templates: Modify the templates to fit your specific evaluation needs.
Remember: template variables should have the same naming as the desired columns in the dataframe
Explanations: Enable
provide_explanation=True
to include detailed reasoning in the evaluation output.
Examples
Emotion Dection
Custom Template
Last updated