Last updated
Copyright © 2023 Arize AI, Inc
Last updated
Instrumenting your audio application to send events and traces to Arize involves capturing key events from the OpenAI Realtime API's WebSocket and converting them into spans that provide meaningful insights into your system's behavior.
We have identified the following events from OpenAI Realtime API's WebSocket as the most valuable for LLM observability. While there are many other events, the majority of useful information can be captured by listening for these events:
Session Events
session.created
: Indicates the creation of a new session.
session.updated
: Denotes updates to the session's parameters or state.
Audio Input Events
input_audio_buffer.speech_started
: Signals the start of speech input.
input_audio_buffer.speech_stopped
: Indicates the end of speech input.
input_audio_buffer.committed
: Confirms that the audio input buffer has been committed for processing.
Conversation Events
conversation.item.created
: Represents the creation of a new conversation item, such as a user message.
Response Events
response.audio_transcript.delta
: Provides incremental transcripts of the audio response.
response.audio_transcript.done
: Indicates the completion of the audio transcript.
response.done
: Marks the completion of the response generation.
response.audio.delta
: Represents the output audio bytes
Error Events
error
: Conveys any errors encountered during processing.
For each of these key events, you can create corresponding spans to capture the event's context and metadata:
Session Management
Upon receiving session.created
, start a new span to represent the session's lifecycle.
Update the span with any changes when session.updated
is received.
Audio Input Handling
Start a span when input_audio_buffer.speech_started
is detected.
Attach attributes such as input audio URL, MIME type, and transcript as they become available.
End the span upon receiving input_audio_buffer.speech_stopped
.
Conversation Tracking
Create a span for each conversation.item.created
event to monitor user inputs and system messages.
Include attributes like message role and content.
Response Generation
Initiate a span when response generation begins.
Update the span with incremental transcripts from response.audio_transcript.delta
.
Finalize the span upon receiving response.done
, adding attributes such as output audio URL, MIME type, and any function call details.
Error Handling
For any error
events, log the error details within the relevant active span to aid in debugging and observability.
Session Creation: When receiving a session.created event, start a parent span to represent the session lifecycle.
Conversation Item Creation: Process user inputs and generate spans for user messages.
Response Handling: Log output audio transcripts and set response attributes.
Handling Completion: Finalize spans when a response.done event is received.
Tool Calls and Nested Spans: For response.function_call_arguments.done, create nested spans to track tool invocations.
When processing tool calls, you may need to extract attributes and metadata about the tools and set them in spans for observability. Below is an example implementation for processing tools within a session update event. This is just one example and can be adapted for your specific use case.
Adding URLs: Add input and output audio URLs to the span whenever they become available.
When working with URLs, you may need to save audio files or other data to a storage service like Google Cloud Storage (GCS). Below is an example implementation for GCS. Please note, this is just one example, and you may need to adjust the code for your specific storage solution
This example performs the following steps:
Save the Audio to a Local File: Converts the PCM16 audio data into a WAV file.
Upload to GCS: Uploads the WAV file to the specified GCS bucket.
Set Span Attribute: Adds the GCS URL as an attribute to the span for observability.
Clean Up: Deletes the local file after it has been uploaded to GCS.
Replace bucket_name
, destination_blob_name
, and file_path
with your own values.
This is an example specific to Google Cloud Storage. You can adapt a similar pattern for other storage providers like AWS S3 or Azure Blob Storage.
If you need the file to be public, set the make_public
parameter to True
.
This example illustrates one way to handle storage, but always tailor the implementation to fit your infrastructure and application needs.
The following semantic conventions define attributes for sessions, audio, conversations, responses, and errors.
Session Attributes
session.id
: Unique identifier for the session.
session.status
: Current status of the session (e.g., active, completed).
Audio Attributes
input.audio.url
: URL of the input audio file.
input.audio.mime_type
: MIME type of the input audio (e.g., audio/wav).
input.audio.transcript
: Transcript of the input audio.
output.audio.url
: URL of the output audio file.
output.audio.mime_type
: MIME type of the output audio.
output.audio.transcript
: Transcript of the output audio.
Conversation Attributes
message.role
: Role of the message sender (e.g., user, system).
message.content
: Content of the message.
Response Attributes
response.id
: Unique identifier for the response.
response.status
: Status of the response (e.g., in_progress, completed).
response.token_count
: Number of tokens in the response.
Error Attributes
error.type
: Type of error encountered.
error.message
: Detailed error message.
While this guide provides a framework for instrumentation, tailor the implementation to fit your application's architecture. Ensure that your instrumentation captures the specified key events to provide comprehensive observability into your application's interactions with the OpenAI Realtime API.