Skip to content

Commit acf59b6

Browse files
xitzhangXiting Zhang
andauthored
[VoiceLive] Release 1.2.0b2 with MCP fix (#44101)
* [VoiceLive] Release 1.2.0b2 with MCP fix * update codeowner --------- Co-authored-by: Xiting Zhang <[email protected]>
1 parent 046b973 commit acf59b6

File tree

8 files changed

+345
-28
lines changed

8 files changed

+345
-28
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -261,7 +261,7 @@
261261

262262
# PRLabel: %Voice Live
263263
# ServiceLabel: %Voice Live %Service Attention
264-
/sdk/ai/azure-ai-voicelive/ @rhurey @xitzhang
264+
/sdk/ai/azure-ai-voicelive/ @rhurey @xitzhang @amber-yujueWang
265265

266266

267267
# PRLabel: %HDInsight

.vscode/cspell.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2187,7 +2187,7 @@
21872187
},
21882188
{
21892189
"filename": "sdk/ai/azure-ai-voicelive/**",
2190-
"words": ["viseme","VISEME","ulaw","ULAW","logprobs","pyaudio","PyAudio","libasound"]
2190+
"words": ["viseme","VISEME","ulaw","ULAW","logprobs","pyaudio","PyAudio","libasound","webrtc","WEBRTC"]
21912191
}
21922192
],
21932193
"allowCompoundWords": true

sdk/ai/azure-ai-voicelive/CHANGELOG.md

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,30 @@
11
# Release History
22

3-
## 1.2.0b2 (Unreleased)
3+
## 1.2.0b2 (2025-11-20)
44

55
### Features Added
66

7-
### Breaking Changes
8-
9-
### Bugs Fixed
10-
11-
### Other Changes
7+
- **Enhanced Avatar Configuration**: Expanded avatar functionality with new configuration options:
8+
- Added `AvatarConfigTypes` enum with support for `video-avatar` and `photo-avatar` types
9+
- Added `PhotoAvatarBaseModes` enum for photo avatar base models (e.g., `vasa-1`)
10+
- Added `AvatarOutputProtocol` enum for avatar streaming protocols (`webrtc`, `websocket`)
11+
- Enhanced `AvatarConfig` model with new properties: `type`, `model`, and `output_protocol`
12+
- **Image Content Support**: Added support for image inputs in conversations:
13+
- New `RequestImageContentPart` model for including images in requests
14+
- New `RequestImageContentPartDetail` enum for controlling image detail levels (`auto`, `low`, `high`)
15+
- Added `INPUT_IMAGE` to `ContentPartType` enum
16+
- Enhanced token details models (`InputTokenDetails`, `CachedTokenDetails`) with `image_tokens` tracking
17+
- **Enhanced OpenAI Voices**: Added new OpenAI voice options:
18+
- Added `marin` and `cedar` voices to `OpenAIVoiceName` enum
19+
- **Extended Azure Personal Voice Configuration**: Enhanced `AzurePersonalVoice` with additional customization options:
20+
- Added support for custom lexicon via `custom_lexicon_url`
21+
- Added `prefer_locales` for locale preferences
22+
- Added `locale`, `style`, `pitch`, `rate`, and `volume` properties for fine-tuned voice control
23+
- **Enhanced MCP Server Events**: Added completion status events for MCP tool calls:
24+
- `ServerEventResponseMcpCallInProgress` for tracking in-progress MCP calls
25+
- `ServerEventResponseMcpCallCompleted` for successful MCP call completion
26+
- `ServerEventResponseMcpCallFailed` for failed MCP calls
27+
- **Pre-generated Assistant Messages**: Added support for pre-generated assistant messages in `ResponseCreateParams` via the `pre_generated_assistant_message` property
1228

1329
## 1.2.0b1 (2025-11-14)
1430

sdk/ai/azure-ai-voicelive/apiview-properties.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161
"azure.ai.voicelive.models.OutputTextContentPart": "VoiceLive.OutputTextContentPart",
6262
"azure.ai.voicelive.models.OutputTokenDetails": "VoiceLive.OutputTokenDetails",
6363
"azure.ai.voicelive.models.RequestAudioContentPart": "VoiceLive.RequestAudioContentPart",
64+
"azure.ai.voicelive.models.RequestImageContentPart": "VoiceLive.RequestImageContentPart",
6465
"azure.ai.voicelive.models.RequestSession": "VoiceLive.RequestSession",
6566
"azure.ai.voicelive.models.RequestTextContentPart": "VoiceLive.RequestTextContentPart",
6667
"azure.ai.voicelive.models.Response": "VoiceLive.Response",
@@ -115,6 +116,9 @@
115116
"azure.ai.voicelive.models.ServerEventResponseFunctionCallArgumentsDone": "VoiceLive.ServerEventResponseFunctionCallArgumentsDone",
116117
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDelta": "VoiceLive.ServerEventResponseMcpCallArgumentsDelta",
117118
"azure.ai.voicelive.models.ServerEventResponseMcpCallArgumentsDone": "VoiceLive.ServerEventResponseMcpCallArgumentsDone",
119+
"azure.ai.voicelive.models.ServerEventResponseMcpCallCompleted": "VoiceLive.ServerEventResponseMcpCallCompleted",
120+
"azure.ai.voicelive.models.ServerEventResponseMcpCallFailed": "VoiceLive.ServerEventResponseMcpCallFailed",
121+
"azure.ai.voicelive.models.ServerEventResponseMcpCallInProgress": "VoiceLive.ServerEventResponseMcpCallInProgress",
118122
"azure.ai.voicelive.models.ServerEventResponseOutputItemAdded": "VoiceLive.ServerEventResponseOutputItemAdded",
119123
"azure.ai.voicelive.models.ServerEventResponseOutputItemDone": "VoiceLive.ServerEventResponseOutputItemDone",
120124
"azure.ai.voicelive.models.ServerEventResponseTextDelta": "VoiceLive.ServerEventResponseTextDelta",
@@ -149,10 +153,14 @@
149153
"azure.ai.voicelive.models.InputAudioFormat": "VoiceLive.InputAudioFormat",
150154
"azure.ai.voicelive.models.TurnDetectionType": "VoiceLive.TurnDetectionType",
151155
"azure.ai.voicelive.models.EouThresholdLevel": "VoiceLive.EouThresholdLevel",
156+
"azure.ai.voicelive.models.AvatarConfigTypes": "VoiceLive.AvatarConfigTypes",
157+
"azure.ai.voicelive.models.PhotoAvatarBaseModes": "VoiceLive.PhotoAvatarBaseModes",
158+
"azure.ai.voicelive.models.AvatarOutputProtocol": "VoiceLive.AvatarOutputProtocol",
152159
"azure.ai.voicelive.models.AudioTimestampType": "VoiceLive.AudioTimestampType",
153160
"azure.ai.voicelive.models.ToolChoiceLiteral": "VoiceLive.ToolChoiceLiteral",
154161
"azure.ai.voicelive.models.ResponseStatus": "VoiceLive.ResponseStatus",
155162
"azure.ai.voicelive.models.ResponseItemStatus": "VoiceLive.ResponseItemStatus",
163+
"azure.ai.voicelive.models.RequestImageContentPartDetail": "VoiceLive.RequestImageContentPartDetail",
156164
"azure.ai.voicelive.models.ServerEventType": "VoiceLive.ServerEventType"
157165
}
158166
}

sdk/ai/azure-ai-voicelive/azure/ai/voicelive/models/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
OutputTextContentPart,
7373
OutputTokenDetails,
7474
RequestAudioContentPart,
75+
RequestImageContentPart,
7576
RequestSession,
7677
RequestTextContentPart,
7778
Response,
@@ -126,6 +127,9 @@
126127
ServerEventResponseFunctionCallArgumentsDone,
127128
ServerEventResponseMcpCallArgumentsDelta,
128129
ServerEventResponseMcpCallArgumentsDone,
130+
ServerEventResponseMcpCallCompleted,
131+
ServerEventResponseMcpCallFailed,
132+
ServerEventResponseMcpCallInProgress,
129133
ServerEventResponseOutputItemAdded,
130134
ServerEventResponseOutputItemDone,
131135
ServerEventResponseTextDelta,
@@ -151,6 +155,8 @@
151155
from ._enums import ( # type: ignore
152156
AnimationOutputType,
153157
AudioTimestampType,
158+
AvatarConfigTypes,
159+
AvatarOutputProtocol,
154160
AzureVoiceType,
155161
ClientEventType,
156162
ContentPartType,
@@ -164,6 +170,8 @@
164170
OpenAIVoiceName,
165171
OutputAudioFormat,
166172
PersonalVoiceModels,
173+
PhotoAvatarBaseModes,
174+
RequestImageContentPartDetail,
167175
ResponseItemStatus,
168176
ResponseStatus,
169177
ServerEventType,
@@ -234,6 +242,7 @@
234242
"OutputTextContentPart",
235243
"OutputTokenDetails",
236244
"RequestAudioContentPart",
245+
"RequestImageContentPart",
237246
"RequestSession",
238247
"RequestTextContentPart",
239248
"Response",
@@ -288,6 +297,9 @@
288297
"ServerEventResponseFunctionCallArgumentsDone",
289298
"ServerEventResponseMcpCallArgumentsDelta",
290299
"ServerEventResponseMcpCallArgumentsDone",
300+
"ServerEventResponseMcpCallCompleted",
301+
"ServerEventResponseMcpCallFailed",
302+
"ServerEventResponseMcpCallInProgress",
291303
"ServerEventResponseOutputItemAdded",
292304
"ServerEventResponseOutputItemDone",
293305
"ServerEventResponseTextDelta",
@@ -310,6 +322,8 @@
310322
"VoiceLiveErrorDetails",
311323
"AnimationOutputType",
312324
"AudioTimestampType",
325+
"AvatarConfigTypes",
326+
"AvatarOutputProtocol",
313327
"AzureVoiceType",
314328
"ClientEventType",
315329
"ContentPartType",
@@ -323,6 +337,8 @@
323337
"OpenAIVoiceName",
324338
"OutputAudioFormat",
325339
"PersonalVoiceModels",
340+
"PhotoAvatarBaseModes",
341+
"RequestImageContentPartDetail",
326342
"ResponseItemStatus",
327343
"ResponseStatus",
328344
"ServerEventType",

sdk/ai/azure-ai-voicelive/azure/ai/voicelive/models/_enums.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,24 @@ class AudioTimestampType(str, Enum, metaclass=CaseInsensitiveEnumMeta):
2626
"""Timestamps per word in the output audio."""
2727

2828

29+
class AvatarConfigTypes(str, Enum, metaclass=CaseInsensitiveEnumMeta):
30+
"""Avatar config types."""
31+
32+
VIDEO_AVATAR = "video-avatar"
33+
"""Video avatar"""
34+
PHOTO_AVATAR = "photo-avatar"
35+
"""Photo avatar"""
36+
37+
38+
class AvatarOutputProtocol(str, Enum, metaclass=CaseInsensitiveEnumMeta):
39+
"""Avatar config output protocols."""
40+
41+
WEBRTC = "webrtc"
42+
"""WebRTC protocol, output the audio/video streams via WebRTC"""
43+
WEBSOCKET = "websocket"
44+
"""WebSocket protocol, output the video frames over WebSocket"""
45+
46+
2947
class AzureVoiceType(str, Enum, metaclass=CaseInsensitiveEnumMeta):
3048
"""Union of all supported Azure voice types."""
3149

@@ -64,6 +82,7 @@ class ContentPartType(str, Enum, metaclass=CaseInsensitiveEnumMeta):
6482

6583
INPUT_TEXT = "input_text"
6684
INPUT_AUDIO = "input_audio"
85+
INPUT_IMAGE = "input_image"
6786
TEXT = "text"
6887
AUDIO = "audio"
6988

@@ -162,6 +181,10 @@ class OpenAIVoiceName(str, Enum, metaclass=CaseInsensitiveEnumMeta):
162181
"""Shimmer voice."""
163182
VERSE = "verse"
164183
"""Verse voice."""
184+
MARIN = "marin"
185+
"""Marin voice."""
186+
CEDAR = "cedar"
187+
"""Cedar voice."""
165188

166189

167190
class OutputAudioFormat(str, Enum, metaclass=CaseInsensitiveEnumMeta):
@@ -190,6 +213,24 @@ class PersonalVoiceModels(str, Enum, metaclass=CaseInsensitiveEnumMeta):
190213
"""Use the Phoenix V2 model."""
191214

192215

216+
class PhotoAvatarBaseModes(str, Enum, metaclass=CaseInsensitiveEnumMeta):
217+
"""Photo avatar base modes."""
218+
219+
VASA1 = "vasa-1"
220+
"""VASA-1 model"""
221+
222+
223+
class RequestImageContentPartDetail(str, Enum, metaclass=CaseInsensitiveEnumMeta):
224+
"""Specifies an image's detail level. Can be 'auto', 'low', 'high', or an unknown future value."""
225+
226+
AUTO = "auto"
227+
"""Automatically select an appropriate detail level."""
228+
LOW = "low"
229+
"""Use a lower detail level to reduce bandwidth or cost."""
230+
HIGH = "high"
231+
"""Use a higher detail level—potentially more resource-intensive."""
232+
233+
193234
class ResponseItemStatus(str, Enum, metaclass=CaseInsensitiveEnumMeta):
194235
"""Indicates the processing status of a response item."""
195236

0 commit comments

Comments
 (0)