voice live API updates

eric-urban · eric-urban · commit 6084d17581b8 · 2025-06-26T11:05:10.000-07:00
diff --git a/articles/ai-services/openai/realtime-audio-reference.md b/articles/ai-services/openai/realtime-audio-reference.md
@@ -1,26 +1,25 @@
 ---
-title: Azure OpenAI in Azure AI Foundry Models Realtime API Reference
+title: Audio events reference
 titleSuffix: Azure OpenAI
-description: Learn how to use the Realtime API to interact with the Azure OpenAI in real-time.
+description: Learn how to use events with the Realtime API and Voice Live API.
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: conceptual
-ms.date: 4/28/2025
+ms.date: 6/27/2025
 author: eric-urban
 ms.author: eur
 recommendations: false
 ---
 
-# Realtime events reference
+# Audio events reference
 
-[!INCLUDE [Feature preview](includes/preview-feature.md)]
+Realtime events are used to communicate between the client and server in real-time audio applications. The events are sent as JSON objects over various endpoints, such as WebSockets or WebRTC. The events are used to manage the conversation, audio buffers, and responses in real-time.
 
-The Realtime API is a WebSocket-based API that allows you to interact with the Azure OpenAI in real-time. 
+You can use audio client and server events with these APIs:
+- [Azure OpenAI Realtime API](/azure/ai-services/openai/realtime-audio-quickstart)
+- [Azure AI Voice Live API](/azure/ai-services/speech-service/voice-live)
 
-The Realtime API (via `/realtime`) is built on [the WebSockets API](https://developer.mozilla.org/docs/Web/API/WebSockets_API) to facilitate fully asynchronous streaming communication between the end user and model. Device details like capturing and rendering audio data are outside the scope of the Realtime API. It should be used in the context of a trusted, intermediate service that manages both connections to end users and model endpoint connections. Don't use it directly from untrusted end user devices.
-
-> [!TIP]
-> To get started with the Realtime API, see the [quickstart](realtime-audio-quickstart.md) and [how-to guide](./how-to/realtime-audio.md).
+Unless otherwise specified, the events described in this document are applicable to both APIs.
 
 ## Client events
 
diff --git a/articles/ai-services/openai/toc.yml b/articles/ai-services/openai/toc.yml
@@ -441,7 +441,7 @@ items:
               displayName: RAG, rag
     - name: Azure OpenAI monitoring data reference
       href: monitor-openai-reference.md
-    - name: Realtime API (preview) events reference
+    - name: Audio events reference
       href: realtime-audio-reference.md
 - name: Resources
   items: 
diff --git a/articles/ai-services/speech-service/includes/quickstarts/voice-live-api/realtime-python.md b/articles/ai-services/speech-service/includes/quickstarts/voice-live-api/realtime-python.md
@@ -4,7 +4,7 @@ author: eric-urban
 ms.author: eur
 ms.service: azure-ai-openai
 ms.topic: include
-ms.date: 5/19/2025
+ms.date: 6/27/2025
 ---
 
 ## Prerequisites
@@ -151,6 +151,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
             session_update = {
                 "type": "session.update",
                 "session": {
+                    "instructions": "You are a helpful AI assistant responding in natural, engaging language.",
                     "turn_detection": {
                         "type": "azure_semantic_vad",
                         "threshold": 0.3,
@@ -170,7 +171,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
                         "type": "server_echo_cancellation"
                     },
                     "voice": {
-                        "name": "en-US-Aria:DragonHDLatestNeural",
+                        "name": "en-US-Ava:DragonHDLatestNeural",
                         "type": "azure-standard",
                         "temperature": 0.8,
                     },
@@ -417,7 +418,7 @@ For the recommended keyless authentication with Microsoft Entra ID, you need to:
 The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.
 
 ```text
-Session created:  {"type": "session.update", "session": {"turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Aria:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
+Session created:  {"type": "session.update", "session": {"instructions": "You are a helpful AI assistant responding in natural, engaging language.","turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
 Starting the chat ...
 Received event: {'session.created'}
 Press 'q' and Enter to quit the chat.
diff --git a/articles/ai-services/speech-service/toc.yml b/articles/ai-services/speech-service/toc.yml
@@ -239,7 +239,7 @@ items:
       href: voice-live-quickstart.md
     - name: How to use Voice Live API
       href: voice-live-how-to.md
-    - name: Realtime API events reference documentation
+    - name: Audio events reference
       href: /azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context
 - name: Intent recognition
   items:
diff --git a/articles/ai-services/speech-service/voice-live-how-to.md b/articles/ai-services/speech-service/voice-live-how-to.md
@@ -7,7 +7,7 @@ author: eric-urban
 ms.author: eur
 ms.service: azure-ai-speech
 ms.topic: how-to
-ms.date: 5/19/2025
+ms.date: 6/27/2025
 ms.custom: references_regions
 # Customer intent: As a developer, I want to learn how to use the Voice Live API for real-time voice agents.
 ---
@@ -58,6 +58,7 @@ Here's an example `session.update` message that configures several aspects of th
 
 ```json
 {
+    "instructions": "You are a helpful AI assistant responding in natural, engaging language.",
     "turn_detection": {
         "type": "azure_semantic_vad",
         "threshold": 0.3,
@@ -73,7 +74,7 @@ Here's an example `session.update` message that configures several aspects of th
     "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"},
     "input_audio_echo_cancellation": {"type": "server_echo_cancellation"},
     "voice": {
-        "name": "en-US-Aria:DragonHDLatestNeural",
+        "name": "en-US-Ava:DragonHDLatestNeural",
         "type": "azure-standard",
         "temperature": 0.8,
     },
@@ -96,8 +97,8 @@ You can use input audio properties to configure the input audio stream.
 | Property | Type | Required or optional | Description |
 |----------|----------|----------|------------|
 | `input_audio_sampling_rate` | integer  | Optional | The sampling rate of the input audio.<br/><br/>The supported values are `16000` and `24000`. The default value is `24000`. |
-| `input_audio_echo_cancellation` | object   | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation` which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice.  |
-| `input_audio_noise_reduction`   | object   | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression` which optimizes for speakers closest to the microphone. |
+| `input_audio_echo_cancellation` | object   | Optional | Enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation.<br/><br/>Set the `type` property of `input_audio_echo_cancellation` to enable echo cancellation.<br/><br/>The supported value for `type` is `server_echo_cancellation`, which is used when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice.  |
+| `input_audio_noise_reduction`   | object   | Optional | Enhances the input audio quality by suppressing or removing environmental background noise.<br/><br/>Set the `type` property of `input_audio_noise_reduction` to enable noise suppression.<br/><br/>The supported value for `type` is `azure_deep_noise_suppression`, which optimizes for speakers closest to the microphone. |
 
 Here's an example of input audio properties is a session object:
 
@@ -113,7 +114,7 @@ Here's an example of input audio properties is a session object:
 
 Noise suppression enhances the input audio quality by suppressing or removing environmental background noise. Noise suppression helps the model understand the end-user with higher accuracy and improves accuracy of signals like interruption detection and end-of-turn detection.
 
-Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice without requiring any client-side echo cancellation. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker, and the microphone picks up the model's own voice.
+Server echo cancellation enhances the input audio quality by removing the echo from the model's own voice. In this way, client-side echo cancellation isn't required. Server echo cancellation is useful when the model's voice is played back to the end-user through a speaker and the microphone picks up the model's own voice.
 
 > [!NOTE]
 > The service assumes the client plays response audio as soon as it receives them. If playback is delayed for more than 3 seconds, echo cancellation quality is impacted.
@@ -141,7 +142,7 @@ Turn detection is the process of detecting when the end-user started or stopped
 
 | Property | Type | Required or optional | Description |
 |----------|----------|----------|------------|
-| `type` | string   | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them. The `azure_semantic_vad` type isn't supported with the `gpt-4o-realtime-preview` and `gpt-4o-mini-realtime-preview` models.<br/><br/>The default value is `server_vad`. |
+| `type` | string   | Optional | The type of turn detection system to use. Type `server_vad` detects start and end of speech based on audio volume.<br/><br/>Type `azure_semantic_vad` detects start and end of speech based on semantic meaning. Azure semantic voice activity detection (VAD) improves turn detection by removing filler words to reduce the false alarm rate. The current list of filler words are `['ah', 'umm', 'mm', 'uh', 'huh', 'oh', 'yeah', 'hmm']`. The service ignores these words when there's an ongoing response. Remove feature words feature assumes the client plays response audio as soon as it receives them.<br/><br/>The default value is `server_vad`. |
 | `threshold` | number | Optional | A higher threshold requires a higher confidence signal of the user trying to speak. |
 | `prefix_padding_ms` | integer | Optional  | The amount of audio, measured in milliseconds, to include before the start of speech detection signal. |
 | `silence_duration_ms` | integer  | Optional | The duration of user's silence, measured in milliseconds, to detect the end of speech. |
@@ -152,6 +153,7 @@ Here's an example of end of utterance detection in a session object:
 ```json
 {
     "session": {
+        "instructions": "You are a helpful AI assistant responding in natural, engaging language.",
         "turn_detection": {
             "type": "azure_semantic_vad",
             "threshold": 0.3,
@@ -168,6 +170,55 @@ Here's an example of end of utterance detection in a session object:
 }
 ```
 
+### Phrase list
+
+Use phrase list for lightweight just-in-time customization on audio input. To configure phrase list, you can set the phrase_list in the `session.update` message. 
+
+```json
+{ 
+    "session": { 
+        "input_audio": { 
+            "phrase_list": ["Neo QLED TV", "TUF Gaming", "AutoQuote Explorer"] 
+        } 
+    } 
+} 
+```
+
+> [!NOTE]
+> Phrase list currently doesn't support gpt-4o-realtime-preview, gpt-4o-mini-realtime-preview, and phi4-mm-realtime. To learn more about phrase list, see [phrase list for speech to text](./improve-accuracy-phrase-list.md).  
+
+### Custom lexicon 
+
+Use the `custom_lexicon_url` string property to customize pronunciation for both standard Azure text to speech voices and custom voices. To learn more about how to format the custom lexicon (the same as Speech Synthesis Markup Language (SSML)), see [custom lexicon for text to speech](./speech-synthesis-markup-pronunciation.md#custom-lexicon).
+
+```json
+{ 
+  "voice": { 
+    "name": "en-US-Ava:DragonHDLatestNeural", 
+    "type": "azure-standard", 
+    "temperature": 0.8, // optional 
+    "custom_lexicon_url": "<custom lexicon url>" 
+  } 
+} 
+```
+
+### Speaking rate 
+
+Use the `rate` string property to adjust the speaking speed for any standard Azure text to speech voices and custom voices.  
+
+The rate value should range from 0.5 to 1.5, with higher values indicating faster speeds. 
+
+```json
+{ 
+  "voice": { 
+    "name": "en-US-Ava:DragonHDLatestNeural", 
+    "type": "azure-standard", 
+    "temperature": 0.8, // optional 
+    "rate": "1.2" 
+  } 
+} 
+```
+
 ### Audio output through Azure text to speech
 
 You can use the `voice` parameter to specify a standard or custom voice. The voice is used for audio output.
@@ -176,7 +227,7 @@ The `voice` object has the following properties:
 
 | Property | Type | Required or optional | Description |
 |----------|----------|----------|------------|
-| `name` | string   | Required | Specifies the name of the voice. For example, `en-US-AriaNeural`. |
+| `name` | string   | Required | Specifies the name of the voice. For example, `en-US-AvaNeural`. |
 | `type` | string   | Required | Configuration of the type of Azure voice between `azure-standard` and `azure-custom`. |
 | `temperature` | number   | Optional | Specifies temperature applicable to Azure HD voices. Higher values provide higher levels of variability in intonation, prosody, etc. |
 
@@ -187,7 +238,7 @@ Here's a partial message example for a standard (`azure-standard`) voice:
 ```json
 {
   "voice": {
-    "name": "en-US-AriaNeural",
+    "name": "en-US-AvaNeural",
     "type": "azure-standard"
   }
 }
@@ -202,7 +253,7 @@ Here's an example `session.update` message for a standard high definition voice:
 ```json
 {
   "voice": {
-    "name": "en-US-Aria:DragonHDLatestNeural",
+    "name": "en-US-Ava:DragonHDLatestNeural",
     "type": "azure-standard",
     "temperature": 0.8 // optional
   }
@@ -341,11 +392,11 @@ To configure the viseme, you can set the `animation.outputs` in the `session.upd
   "event_id": "your-session-id",
   "session": {
     "voice": {
-      "name": "en-US-AriaNeural",
+      "name": "en-US-AvaNeural",
       "type": "azure-standard",
     },
     "modalities": ["text", "audio"],
-    "instructions": "You are a helpful assistant.",
+    "instructions": "You are a helpful AI assistant responding in natural, engaging language.",
     "turn_detection": {
         "type": "server_vad"
     },
@@ -388,4 +439,4 @@ And a `response.animation_viseme.done` message is sent when all viseme messages
 ## Related content
 
 - Try out the [Voice Live API quickstart](./voice-live-quickstart.md)
-- See the [Azure OpenAI Realtime API reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)
+- See the [audio events reference](/azure/ai-services/openai/realtime-audio-reference?context=/azure/ai-services/speech-service/context/context)
diff --git a/articles/ai-services/speech-service/voice-live-quickstart.md b/articles/ai-services/speech-service/voice-live-quickstart.md
@@ -5,7 +5,7 @@ description: Learn how to use Voice Live API for real-time voice agents with Azu
 manager: nitinme
 ms.service: azure-ai-openai
 ms.topic: how-to
-ms.date: 5/19/2025
+ms.date: 6/27/2025
 author: eric-urban
 ms.author: eur
 ms.custom: build-2025
diff --git a/articles/ai-services/speech-service/voice-live.md b/articles/ai-services/speech-service/voice-live.md
@@ -7,7 +7,7 @@ author: eric-urban
 ms.author: eur
 ms.service: azure-ai-speech
 ms.topic: how-to
-ms.date: 5/19/2025
+ms.date: 6/27/2025
 ms.custom: references_regions
 # Customer intent: As a developer, I want to learn about the Voice Live API for real-time voice agents.
 ---