Last updated
Copyright © 2023 Arize AI, Inc
Last updated
Currently, voice evaluations are supported exclusively with OpenAI models. Support for additional models is planned and will be available soon.
This guide provides instructions on how to evaluate voice applications using OpenAI models within the Phoenix framework. The example notebook linked below demonstrates the process of configuring and running evaluations.
Phoenix Installation: Make sure the phoenix
package is installed in your Python environment.
OpenAI API Key: Obtain an API key for the OpenAI model you plan to use.
Audio Data: Prepare the audio data required for evaluation. This can be in the form of raw audio bytes, base64-encoded strings, or URLs pointing to audio files. If you have existing data in Arize, you can use our to retrieve it.
Python Environment: Ensure you are using Python version 3.7 or higher.
Use the OpenAIModel
class to define the OpenAI model for evaluation. Replace the placeholder API key with your own.
Templates are used to configure the prompts sent to the OpenAI model. They consist of "rails" (the set of valid responses) and a sequence of prompt parts, each specifying the type and content of the input or instructions. Below is a complete example for a tone and emotion classification template.
Explanation of Components
Rails: These are the valid outputs for the classification task. In this case, the rails are ["positive", "neutral", "negative"]
.
Prompt Parts: The template breaks the prompt into multiple parts, each of a specific content type:
TEXT: Static instructions or context provided as text.
AUDIO: Dynamically inserted audio data for analysis.
Templates: These define the content that will be sent to the model. For example, one part provides a description of the task, while another includes the audio data.
How It Works
Prompt Parts:
The first part provides clear instructions to the model, explaining the task and the valid response labels.
The second part dynamically inserts the audio data for evaluation.
The third part ensures the model outputs the result in the required format (a single-word string with no extra characters).
Rails:
These define the set of valid responses that the model can produce (i.e positive
, neutral
, negative
). Any result outside this set can be flagged or considered invalid.
Dynamic Inputs:
The {audio_data}
placeholder is replaced with the actual base64-encoded audio string at runtime. This is handled by llm_classify when the content_type is set to PromptPartContentType.AUDIO
. The variable name should correspond with the column containing either your audio url or data
If you only want to evaluate text (i.e the transcript) then you can just use the prompt directly as a string.
This modular approach ensures flexibility, allowing you to reuse and adapt templates for different use cases or models.
A data processor transforms raw audio data into base64-encoded strings. For example, fetching and encoding audio from Google Cloud Storage:
If your audio data is already in base64 format as an encoded string, you can skip this step.
Use the llm_classify
function to run the evaluation. Pass your data as a Pandas DataFrame.
Data Processors: Custom functions can transform audio paths, URLs, or raw data to the required format.
Templates: Modify the templates to fit your specific evaluation needs.
note: template variables should have the same naming as the desired columns in the dataframe
Explanations: Enable provide_explanation=True
to include detailed reasoning in the evaluation output.