Skip to content

Commit b03bf6d

Browse files
Merge pull request #7454 from goergenj/voicelive-hotfix-10-02
Fixing voicelive model quickstart blocking issues and bugs and updati…
2 parents 7c4dd21 + 76c3266 commit b03bf6d

File tree

3 files changed

+130
-105
lines changed

3 files changed

+130
-105
lines changed

articles/ai-services/speech-service/includes/quickstarts/voice-live-api/python.md

Lines changed: 119 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
---
22
manager: nitinme
3-
author: PatrickFarley
4-
ms.author: pafarley
3+
author: goergenj
4+
ms.author: jagoerge
5+
reviewer: patrickfarley
6+
ms.reviewer: pafarley
57
ms.service: azure-ai-openai
68
ms.topic: include
7-
ms.date: 7/31/2025
9+
ms.date: 10/02/2025
810
---
911

1012
In this article, you learn how to use Azure AI Speech voice live with [Azure AI Foundry models](/azure/ai-foundry/concepts/foundry-models-overview) using the VoiceLive SDK for Python.
@@ -123,6 +125,9 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
123125
print("This sample requires pyaudio. Install with: pip install pyaudio")
124126
sys.exit(1)
125127
128+
## Change to the directory where this script is located
129+
os.chdir(os.path.dirname(os.path.abspath(__file__)))
130+
126131
# Environment variable loading
127132
try:
128133
from dotenv import load_dotenv
@@ -146,13 +151,11 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
146151
ServerVad,
147152
AzureStandardVoice,
148153
Modality,
149-
AudioFormat,
154+
InputAudioFormat,
155+
OutputAudioFormat,
150156
)
151157
152158
# Set up logging
153-
## Change to the directory where this script is located
154-
os.chdir(os.path.dirname(os.path.abspath(__file__)))
155-
156159
## Add folder for logging
157160
if not os.path.exists('logs'):
158161
os.makedirs('logs')
@@ -486,8 +489,8 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
486489
modalities=[Modality.TEXT, Modality.AUDIO],
487490
instructions=self.instructions,
488491
voice=voice_config,
489-
input_audio_format=AudioFormat.PCM16,
490-
output_audio_format=AudioFormat.PCM16,
492+
input_audio_format=InputAudioFormat.PCM16,
493+
output_audio_format=OutputAudioFormat.PCM16,
491494
turn_detection=turn_detection_config,
492495
)
493496

@@ -602,10 +605,9 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
602605

603606
parser.add_argument(
604607
"--voice",
605-
help="Voice to use for the assistant",
608+
help="Voice to use for the assistant. E.g. alloy, echo, fable, en-US-AvaNeural, en-US-GuyNeural",
606609
type=str,
607610
default=os.environ.get("AZURE_VOICELIVE_VOICE", "en-US-Ava:DragonHDLatestNeural"),
608-
help="Voice to use for the assistant. E.g. alloy, echo, fable, en-US-AvaNeural, en-US-GuyNeural",
609611
)
610612

611613
parser.add_argument(
@@ -620,7 +622,7 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
620622
)
621623

622624
parser.add_argument(
623-
"--use-token-credential", help="Use Azure token credential instead of API key", action="store_true"
625+
"--use-token-credential", help="Use Azure token credential instead of API key", action="store_true", default=True
624626
)
625627

626628
parser.add_argument("--verbose", help="Enable verbose logging", action="store_true")
@@ -754,106 +756,126 @@ The sample code in this quickstart uses either Microsoft Entra ID or an API key
754756
755757
## Output
756758
757-
The output of the script is printed to the console. You see messages indicating the status of the connection, audio stream, and playback. The audio is played back through your speakers or headphones.
759+
The output of the script is printed to the console. You see messages indicating the status of system. The audio is played back through your speakers or headphones.
758760
759761
```console
760-
Session created: {"type": "session.update", "session": {"instructions": "You are a helpful AI assistant responding in natural, engaging language.","turn_detection": {"type": "azure_semantic_vad", "threshold": 0.3, "prefix_padding_ms": 200, "silence_duration_ms": 200, "remove_filler_words": false, "end_of_utterance_detection": {"model": "semantic_detection_v1", "threshold": 0.1, "timeout": 4}}, "input_audio_noise_reduction": {"type": "azure_deep_noise_suppression"}, "input_audio_echo_cancellation": {"type": "server_echo_cancellation"}, "voice": {"name": "en-US-Ava:DragonHDLatestNeural", "type": "azure-standard", "temperature": 0.8}}, "event_id": ""}
761-
Starting the chat ...
762-
Received event: {'session.created'}
763-
Press 'q' and Enter to quit the chat.
764-
Received event: {'session.updated'}
765-
Received event: {'input_audio_buffer.speech_started'}
766-
Received event: {'input_audio_buffer.speech_stopped'}
767-
Received event: {'input_audio_buffer.committed'}
768-
Received event: {'conversation.item.input_audio_transcription.completed'}
769-
Received event: {'conversation.item.created'}
770-
Received event: {'response.created'}
771-
Received event: {'response.output_item.added'}
772-
Received event: {'conversation.item.created'}
773-
Received event: {'response.content_part.added'}
774-
Received event: {'response.audio_transcript.delta'}
775-
Received event: {'response.audio_transcript.delta'}
776-
Received event: {'response.audio_transcript.delta'}
777-
REDACTED FOR BREVITY
778-
Received event: {'response.audio.delta'}
779-
Received event: {'response.audio.delta'}
780-
Received event: {'response.audio.delta'}
781-
q
782-
Received event: {'response.audio.delta'}
783-
Received event: {'response.audio.delta'}
784-
Received event: {'response.audio.delta'}
785-
Received event: {'response.audio.delta'}
786-
Received event: {'response.audio.delta'}
787-
Quitting the chat...
788-
Received event: {'response.audio.delta'}
789-
Received event: {'response.audio.delta'}
790-
REDACTED FOR BREVITY
791-
Received event: {'response.audio.delta'}
792-
Received event: {'response.audio.delta'}
793-
Chat done.
762+
============================================================
763+
🎤 VOICE ASSISTANT READY
764+
Start speaking to begin conversation
765+
Press Ctrl+C to exit
766+
============================================================
767+
768+
🎤 Listening...
769+
🤔 Processing...
770+
🎤 Ready for next input...
771+
🎤 Listening...
772+
🤔 Processing...
773+
🎤 Ready for next input...
774+
🎤 Listening...
775+
🤔 Processing...
776+
🎤 Ready for next input...
777+
🎤 Listening...
778+
🤔 Processing...
779+
🎤 Listening...
780+
🎤 Ready for next input...
781+
🤔 Processing...
782+
🎤 Ready for next input...
794783
```
795784
796785
The script that you ran creates a log file named `<timestamp>_voicelive.log` in the `logs` folder.
797786
787+
The default loglevel is set to **INFO** but you can change it by running the quickstart with the command line parameter `--verbose` or by changing the logging config within the code as follows:
788+
798789
```python
799790
logging.basicConfig(
800791
filename=f'logs/{timestamp}_voicelive.log',
801792
filemode="w",
802793
level=logging.DEBUG,
803794
format='%(asctime)s:%(name)s:%(levelname)s:%(message)s'
804-
)
805795
```
806796
807797
The log file contains information about the connection to the Voice Live API, including the request and response data. You can view the log file to see the details of the conversation.
808798
809799
```text
810-
2025-05-09 06:56:06,821:websockets.client:DEBUG:= connection is CONNECTING
811-
2025-05-09 06:56:07,101:websockets.client:DEBUG:> GET /voice-live/realtime?api-version=2025-05-01-preview&model=gpt-4o HTTP/1.1
812-
<REDACTED FOR BREVITY>
813-
2025-05-09 06:56:07,551:websockets.client:DEBUG:= connection is OPEN
814-
2025-05-09 06:56:07,551:websockets.client:DEBUG:< TEXT '{"event_id":"event_5a7NVdtNBVX9JZVuPc9nYK","typ...es":null,"agent":null}}' [1475 bytes]
815-
2025-05-09 06:56:07,552:websockets.client:DEBUG:> TEXT '{"type": "session.update", "session": {"turn_de....8}}, "event_id": null}' [551 bytes]
816-
2025-05-09 06:56:07,557:__main__:INFO:Starting audio stream ...
817-
2025-05-09 06:56:07,810:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
818-
2025-05-09 06:56:07,824:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
819-
2025-05-09 06:56:07,844:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
820-
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
821-
2025-05-09 06:56:07,874:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAEA", "event_id": ""}' [1346 bytes]
822-
2025-05-09 06:56:07,905:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
823-
2025-05-09 06:56:07,926:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
824-
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
825-
2025-05-09 06:56:07,954:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...///7/", "event_id": ""}' [1346 bytes]
826-
2025-05-09 06:56:07,974:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAAA", "event_id": ""}' [1346 bytes]
827-
2025-05-09 06:56:08,004:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
828-
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
829-
2025-05-09 06:56:08,035:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
830-
<REDACTED FOR BREVITY>
831-
2025-05-09 06:56:42,957:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
832-
2025-05-09 06:56:42,984:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
833-
2025-05-09 06:56:43,005:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
834-
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+////", "event_id": ""}' [1346 bytes]
835-
2025-05-09 06:56:43,034:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAMA", "event_id": ""}' [1346 bytes]
836-
2025-05-09 06:56:43,055:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...CAAIA", "event_id": ""}' [1346 bytes]
837-
2025-05-09 06:56:43,084:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
838-
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...9//3/", "event_id": ""}' [1346 bytes]
839-
2025-05-09 06:56:43,114:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...DAAMA", "event_id": ""}' [1346 bytes]
840-
2025-05-09 06:56:43,134:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
841-
2025-05-09 06:56:43,165:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAAAA", "event_id": ""}' [1346 bytes]
842-
2025-05-09 06:56:43,184:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+//7/", "event_id": ""}' [1346 bytes]
843-
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": .../////", "event_id": ""}' [1346 bytes]
844-
2025-05-09 06:56:43,214:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...+/wAA", "event_id": ""}' [1346 bytes]
845-
2025-05-09 06:56:43,245:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAIA", "event_id": ""}' [1346 bytes]
846-
2025-05-09 06:56:43,264:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...AAP//", "event_id": ""}' [1346 bytes]
847-
2025-05-09 06:56:43,295:websockets.client:DEBUG:> TEXT '{"type": "input_audio_buffer.append", "audio": ...BAAEA", "event_id": ""}' [1346 bytes]
848-
2025-05-09 06:56:43,295:websockets.client:DEBUG:> CLOSE 1000 (OK) [2 bytes]
849-
2025-05-09 06:56:43,297:websockets.client:DEBUG:= connection is CLOSING
850-
2025-05-09 06:56:43,346:__main__:INFO:Audio stream closed.
851-
2025-05-09 06:56:43,388:__main__:INFO:Playback done.
852-
2025-05-09 06:56:44,512:websockets.client:DEBUG:< CLOSE 1000 (OK) [2 bytes]
853-
2025-05-09 06:56:44,514:websockets.client:DEBUG:< EOF
854-
2025-05-09 06:56:44,514:websockets.client:DEBUG:> EOF
855-
2025-05-09 06:56:44,514:websockets.client:DEBUG:= connection is CLOSED
856-
2025-05-09 06:56:44,514:websockets.client:DEBUG:x closing TCP connection
857-
2025-05-09 06:56:44,514:asyncio:ERROR:Unclosed client session
858-
client_session: <aiohttp.client.ClientSession object at 0x00000266DD8E5400>
800+
2025-10-02 14:47:37,901:__main__:INFO:Using Azure token credential
801+
2025-10-02 14:47:37,901:__main__:INFO:Connecting to VoiceLive API with model gpt-realtime
802+
2025-10-02 14:47:37,901:azure.core.pipeline.policies.http_logging_policy:INFO:Request URL: 'https://login.microsoftonline.com/organizations/v2.0/.well-known/openid-configuration'
803+
Request method: 'GET'
804+
Request headers:
805+
'User-Agent': 'azsdk-python-identity/1.22.0 Python/3.11.9 (Windows-10-10.0.26200-SP0)'
806+
No body was attached to the request
807+
2025-10-02 14:47:38,057:azure.core.pipeline.policies.http_logging_policy:INFO:Response status: 200
808+
Response headers:
809+
'Date': 'Thu, 02 Oct 2025 21:47:37 GMT'
810+
'Content-Type': 'application/json; charset=utf-8'
811+
'Content-Length': '1641'
812+
'Connection': 'keep-alive'
813+
'Cache-Control': 'max-age=86400, private'
814+
'Strict-Transport-Security': 'REDACTED'
815+
'X-Content-Type-Options': 'REDACTED'
816+
'Access-Control-Allow-Origin': 'REDACTED'
817+
'Access-Control-Allow-Methods': 'REDACTED'
818+
'P3P': 'REDACTED'
819+
'x-ms-request-id': 'f81adfa1-8aa3-4ab6-a7b8-908f411e0d00'
820+
'x-ms-ests-server': 'REDACTED'
821+
'x-ms-srs': 'REDACTED'
822+
'Content-Security-Policy-Report-Only': 'REDACTED'
823+
'Cross-Origin-Opener-Policy-Report-Only': 'REDACTED'
824+
'Reporting-Endpoints': 'REDACTED'
825+
'X-XSS-Protection': 'REDACTED'
826+
'Set-Cookie': 'REDACTED'
827+
'X-Cache': 'REDACTED'
828+
2025-10-02 14:47:42,105:azure.core.pipeline.policies.http_logging_policy:INFO:Request URL: 'https://login.microsoftonline.com/organizations/oauth2/v2.0/token'
829+
Request method: 'POST'
830+
Request headers:
831+
'Accept': 'application/json'
832+
'x-client-sku': 'REDACTED'
833+
'x-client-ver': 'REDACTED'
834+
'x-client-os': 'REDACTED'
835+
'x-ms-lib-capability': 'REDACTED'
836+
'client-request-id': 'REDACTED'
837+
'x-client-current-telemetry': 'REDACTED'
838+
'x-client-last-telemetry': 'REDACTED'
839+
'X-AnchorMailbox': 'REDACTED'
840+
'User-Agent': 'azsdk-python-identity/1.22.0 Python/3.11.9 (Windows-10-10.0.26200-SP0)'
841+
A body is sent with the request
842+
2025-10-02 14:47:42,466:azure.core.pipeline.policies.http_logging_policy:INFO:Response status: 200
843+
Response headers:
844+
'Date': 'Thu, 02 Oct 2025 21:47:42 GMT'
845+
'Content-Type': 'application/json; charset=utf-8'
846+
'Content-Length': '6587'
847+
'Connection': 'keep-alive'
848+
'Cache-Control': 'no-store, no-cache'
849+
'Pragma': 'no-cache'
850+
'Expires': '-1'
851+
'Strict-Transport-Security': 'REDACTED'
852+
'X-Content-Type-Options': 'REDACTED'
853+
'P3P': 'REDACTED'
854+
'client-request-id': 'REDACTED'
855+
'x-ms-request-id': '2e82e728-22c0-4568-b3ed-f00ec79a2500'
856+
'x-ms-ests-server': 'REDACTED'
857+
'x-ms-clitelem': 'REDACTED'
858+
'x-ms-srs': 'REDACTED'
859+
'Content-Security-Policy-Report-Only': 'REDACTED'
860+
'Cross-Origin-Opener-Policy-Report-Only': 'REDACTED'
861+
'Reporting-Endpoints': 'REDACTED'
862+
'X-XSS-Protection': 'REDACTED'
863+
'Set-Cookie': 'REDACTED'
864+
'X-Cache': 'REDACTED'
865+
2025-10-02 14:47:42,467:azure.identity._internal.interactive:INFO:InteractiveBrowserCredential.get_token succeeded
866+
2025-10-02 14:47:42,884:__main__:INFO:AudioProcessor initialized with 24kHz PCM16 mono audio
867+
2025-10-02 14:47:42,884:__main__:INFO:Setting up voice conversation session...
868+
2025-10-02 14:47:42,887:__main__:INFO:Session configuration sent
869+
2025-10-02 14:47:42,943:__main__:INFO:Audio playback system ready
870+
2025-10-02 14:47:42,943:__main__:INFO:Voice assistant ready! Start speaking...
871+
2025-10-02 14:47:42,975:__main__:INFO:Session ready: sess_CMLRGjWnakODcHn583fXf
872+
2025-10-02 14:47:42,994:__main__:INFO:Started audio capture
873+
2025-10-02 14:47:47,513:__main__:INFO:\U0001f3a4 User started speaking - stopping playback
874+
2025-10-02 14:47:47,593:__main__:INFO:Stopped audio playback
875+
2025-10-02 14:47:51,757:__main__:INFO:\U0001f3a4 User stopped speaking
876+
2025-10-02 14:47:51,813:__main__:INFO:Audio playback system ready
877+
2025-10-02 14:47:51,816:__main__:INFO:\U0001f916 Assistant response created
878+
2025-10-02 14:47:58,009:__main__:INFO:\U0001f916 Assistant finished speaking
879+
2025-10-02 14:47:58,009:__main__:INFO:\u2705 Response complete
880+
2025-10-02 14:48:07,309:__main__:INFO:Received shutdown signal
859881
```

articles/ai-services/speech-service/includes/quickstarts/voice-live-api/resource-authentication.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
---
2-
author: goergenj
3-
ms.author: jagoerge
4-
ms.service: azure-ai-speech
2+
manager: nitinme
3+
author: goergenj
4+
ms.author: jagoerge
5+
reviewer: patrickfarley
6+
ms.reviewer: pafarley
7+
ms.service: azure-ai-openai
58
ms.topic: include
6-
ms.date: 9/26/2025
9+
ms.date: 10/02/2025
710
---
811

912
Create a new file named `.env` in the folder where you want to run the code.
@@ -12,7 +15,7 @@ In the `.env` file, add the following environment variables for authentication:
1215

1316
```plaintext
1417
AZURE_VOICELIVE_ENDPOINT=<your_endpoint>
15-
VOICELIVE_MODEL=<your_model>
18+
AZURE_VOICELIVE_MODEL=<your_model>
1619
AZURE_VOICELIVE_API_VERSION=2025-10-01
1720
AZURE_VOICELIVE_API_KEY=<your_api_key> # Only required if using API key authentication
1821
```

articles/ai-services/speech-service/voice-live.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,9 @@ The API is supported through WebSocket events, allowing for an easy server-to-se
6363

6464
## Supported models and regions
6565

66-
To power the intelligence of your voice agent, you have flexibility and choice in the generative AI model between GPT-4o, GPT-4o-mini, and Phi. Different generative AI models provide different types of capabilities, levels of intelligence, speed/latency of inferencing, and cost. Depending on what matters most for your business and use case, you can choose the model that best suits your needs.
66+
To power the intelligence of your voice agent, you have flexibility and choice in the generative AI model between GPT-Realtime, GPT-5, GPT-4.1, Phi, and more options. Different generative AI models provide different types of capabilities, levels of intelligence, speed/latency of inferencing, and cost. Depending on what matters most for your business and use case, you can choose the model that best suits your needs.
6767

68-
All natively supported models – GPT-4o, GPT-4o-mini, and Phi – are fully managed, meaning you don’t have to deploy models, worry about capacity planning, or provisioning throughput. You can use the model you need, and the Voice live API takes care of the rest.
68+
All natively supported models are fully managed, meaning you don’t have to deploy models, worry about capacity planning, or provisioning throughput. You can use the model you need, and the Voice live API takes care of the rest.
6969

7070
The Voice live API supports the following models. For supported regions, see the [Azure AI Speech service regions](./regions.md?tabs=voice-live#regions).
7171

@@ -118,7 +118,7 @@ If you choose to use custom voice for your speech output, you're charged separat
118118

119119
Avatars are charged separately with [the interactive avatar pricing published here.](https://azure.microsoft.com/pricing/details/cognitive-services/speech-services)
120120

121-
For more details regarding custom voice and avatar training charges, [refer to this pricing note.](/azure/ai-services/speech-service/text-to-speech#model-training-and-hosting-time-for-custom-voice)
121+
For more information regarding custom voice and avatar training charges, [refer to this pricing note.](/azure/ai-services/speech-service/text-to-speech#model-training-and-hosting-time-for-custom-voice)
122122

123123
### Example pricing scenarios
124124

0 commit comments

Comments
 (0)