Skip to content

Commit 521b9c6

Browse files
authored
Merge pull request #205814 from sally-baolian/FacialMotion
Update blendshapes Cog Svcs (Release on 8/5)
2 parents f37344e + 82ed6f1 commit 521b9c6

File tree

5 files changed

+392
-118
lines changed

5 files changed

+392
-118
lines changed

articles/cognitive-services/Speech-Service/how-to-speech-synthesis-viseme.md

Lines changed: 151 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,14 @@ ms.custom: references_regions
1515
zone_pivot_groups: programming-languages-speech-services-nomore-variant
1616
---
1717

18-
# Get facial pose events for lip-sync
18+
# Get facial position with viseme
1919

2020
> [!NOTE]
21-
> At this time, viseme events are available only for [neural voices](language-support.md#text-to-speech).
21+
> Viseme ID supports neural voices in [all viseme-supported locales](language-support.md#viseme). Scalable Vector Graphics (SVG) only supports neural voices in `en-US` locale, and blend shapes supports neural voices in `en-US` and `zh-CN` locales.
2222
23-
A _viseme_ is the visual description of a phoneme in spoken language. It defines the position of the face and mouth when a person speaks a word. Each viseme depicts the key facial poses for a specific set of phonemes.
23+
A *viseme* is the visual description of a phoneme in spoken language. It defines the position of the face and mouth while a person is speaking. Each viseme depicts the key facial poses for a specific set of phonemes.
2424

25-
You can use visemes to control the movement of 2D and 3D avatar models, so that the mouth movements are perfectly matched to synthetic speech. For example, you can:
25+
You can use visemes to control the movement of 2D and 3D avatar models, so that the facial positions are best aligned with synthetic speech. For example, you can:
2626

2727
* Create an animated virtual voice assistant for intelligent kiosks, building multi-mode integrated services for your customers.
2828
* Build immersive news broadcasts and improve audience experiences with natural face and mouth movements.
@@ -33,21 +33,39 @@ You can use visemes to control the movement of 2D and 3D avatar models, so that
3333
For more information about visemes, view this [introductory video](https://youtu.be/ui9XT47uwxs).
3434
> [!VIDEO https://www.youtube.com/embed/ui9XT47uwxs]
3535
36-
## Azure Neural TTS can produce visemes with speech
36+
## Overall workflow of producing viseme with speech
3737

38-
Neural Text-to-Speech (Neural TTS) turns input text or SSML (Speech Synthesis Markup Language) into lifelike synthesized speech. Speech audio output can be accompanied by viseme IDs and their offset timestamps. Each viseme ID specifies a specific pose in observed speech, such as the position of the lips, jaw, and tongue when producing a particular phoneme. Using a 2D or 3D rendering engine, you can use these viseme events to animate your avatar.
38+
Neural Text-to-Speech (Neural TTS) turns input text or SSML (Speech Synthesis Markup Language) into lifelike synthesized speech. Speech audio output can be accompanied by viseme ID, Scalable Vector Graphics (SVG), or blend shapes. Using a 2D or 3D rendering engine, you can use these viseme events to animate your avatar.
3939

4040
The overall workflow of viseme is depicted in the following flowchart:
4141

4242
![Diagram of the overall workflow of viseme.](media/text-to-speech/viseme-structure.png)
4343

44-
*Viseme ID* and *audio offset output* are described in the following table:
44+
You can request viseme output in SSML. For details, see [how to use viseme element in SSML](speech-synthesis-markup.md#viseme-element).
4545

46-
| Visme element | Description |
47-
|-----------|-------------|
48-
| Viseme ID | An integer number that specifies a viseme.<br>For English (US), we offer 22 different visemes, each depicting the mouth shape for a specific set of phonemes. There is no one-to-one correspondence between visemes and phonemes. Often, several phonemes correspond to a single viseme, because they look the same on the speaker's face when they're produced, such as `s` and `z`. For more specific information, see the table for [mapping phonemes to viseme IDs](#map-phonemes-to-visemes). |
49-
| Audio offset | The start time of each viseme, in ticks (100 nanoseconds). |
46+
## Viseme ID
5047

48+
Viseme ID refers to an integer number that specifies a viseme. We offer 22 different visemes, each depicting the mouth shape for a specific set of phonemes. There's no one-to-one correspondence between visemes and phonemes. Often, several phonemes correspond to a single viseme, because they look the same on the speaker's face when they're produced, such as `s` and `z`. For more specific information, see the table for [mapping phonemes to viseme IDs](#map-phonemes-to-visemes).
49+
50+
Speech audio output can be accompanied by viseme IDs and `Audio offset`. The `Audio offset` indicates the offset timestamp that represents the start time of each viseme, in ticks (100 nanoseconds).
51+
52+
### Map phonemes to visemes
53+
54+
Visemes vary by language and locale. Each locale has a set of visemes that correspond to its specific phonemes. The [SSML phonetic alphabets](speech-ssml-phonetic-sets.md) documentation maps viseme IDs to the corresponding International Phonetic Alphabet (IPA) phonemes.
55+
56+
## 2D SVG animation
57+
58+
For 2D characters, you can design a character that suits your scenario and use Scalable Vector Graphics (SVG) for each viseme ID to get a time-based face position.
59+
60+
With temporal tags that are provided in a viseme event, these well-designed SVGs will be processed with smoothing modifications, and provide robust animation to the users. For example, the following illustration shows a red-lipped character that's designed for language learning.
61+
62+
![Screenshot showing a 2D rendering example of four red-lipped mouths, each representing a different viseme ID that corresponds to a phoneme.](media/text-to-speech/viseme-demo-2D.png)
63+
64+
## 3D blend shapes animation
65+
66+
You can use blend shapes to drive the facial movements of a 3D character that you designed.
67+
68+
The blend shapes JSON string is represented as a 2-dimensional matrix. Each row represents a frame. Each frame (in 60 FPS) contains an array of 55 facial positions.
5169

5270
## Get viseme events with the Speech SDK
5371

@@ -65,9 +83,13 @@ using (var synthesizer = new SpeechSynthesizer(speechConfig, audioConfig))
6583
{
6684
Console.WriteLine($"Viseme event received. Audio offset: " +
6785
$"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}.");
86+
87+
// `Animation` is an xml string for SVG or a json string for blend shapes
88+
var animation = e.Animation;
6889
};
6990

70-
var result = await synthesizer.SpeakSsmlAsync(ssml));
91+
// If VisemeID is the only thing you want, you can also use `SpeakTextAsync()`
92+
var result = await synthesizer.SpeakSsmlAsync(ssml);
7193
}
7294

7395
```
@@ -86,8 +108,12 @@ synthesizer->VisemeReceived += [](const SpeechSynthesisVisemeEventArgs& e)
86108
// The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
87109
<< "Audio offset: " << e.AudioOffset / 10000 << "ms, "
88110
<< "viseme id: " << e.VisemeId << "." << endl;
111+
112+
// `Animation` is an xml string for SVG or a json string for blend shapes
113+
auto animation = e.Animation;
89114
};
90115

116+
// If VisemeID is the only thing you want, you can also use `SpeakTextAsync()`
91117
auto result = synthesizer->SpeakSsmlAsync(ssml).get();
92118
```
93119
@@ -103,8 +129,12 @@ synthesizer.VisemeReceived.addEventListener((o, e) -> {
103129
// The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.
104130
System.out.print("Viseme event received. Audio offset: " + e.getAudioOffset() / 10000 + "ms, ");
105131
System.out.println("viseme id: " + e.getVisemeId() + ".");
132+
133+
// `Animation` is an xml string for SVG or a json string for blend shapes
134+
String animation = e.getAnimation();
106135
});
107136
137+
// If VisemeID is the only thing you want, you can also use `SpeakTextAsync()`
108138
SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(ssml).get();
109139
```
110140

@@ -115,10 +145,17 @@ SpeechSynthesisResult result = synthesizer.SpeakSsmlAsync(ssml).get();
115145
```Python
116146
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
117147

148+
def viseme_cb(evt):
149+
print("Viseme event received: audio offset: {}ms, viseme id: {}.".format(
150+
evt.audio_offset / 10000, evt.viseme_id))
151+
152+
# `Animation` is an xml string for SVG or a json string for blend shapes
153+
animation = evt.animation
154+
118155
# Subscribes to viseme received event
119-
speech_synthesizer.viseme_received.connect(lambda evt: print(
120-
"Viseme event received: audio offset: {}ms, viseme id: {}.".format(evt.audio_offset / 10000, evt.viseme_id)))
156+
speech_synthesizer.viseme_received.connect(viseme_cb)
121157

158+
# If VisemeID is the only thing you want, you can also use `speak_text_async()`
122159
result = speech_synthesizer.speak_ssml_async(ssml).get()
123160
```
124161

@@ -132,8 +169,12 @@ var synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
132169
// Subscribes to viseme received event
133170
synthesizer.visemeReceived = function (s, e) {
134171
window.console.log("(Viseme), Audio offset: " + e.audioOffset / 10000 + "ms. Viseme ID: " + e.visemeId);
172+
173+
// `Animation` is an xml string for SVG or a json string for blend shapes
174+
var animation = e.Animation;
135175
}
136176

177+
// If VisemeID is the only thing you want, you can also use `speakTextAsync()`
137178
synthesizer.speakSsmlAsync(ssml);
138179
```
139180

@@ -149,14 +190,20 @@ SPXSpeechSynthesizer *synthesizer =
149190
// Subscribes to viseme received event
150191
[synthesizer addVisemeReceivedEventHandler: ^ (SPXSpeechSynthesizer *synthesizer, SPXSpeechSynthesisVisemeEventArgs *eventArgs) {
151192
NSLog(@"Viseme event received. Audio offset: %fms, viseme id: %lu.", eventArgs.audioOffset/10000., eventArgs.visemeId);
193+
194+
// `Animation` is an xml string for SVG or a json string for blend shapes
195+
NSString *animation = eventArgs.Animation;
152196
}];
153197

198+
// If VisemeID is the only thing you want, you can also use `SpeakText`
154199
[synthesizer speakSsml:ssml];
155200
```
156201
157202
::: zone-end
158203
159-
Here is an example of the viseme output.
204+
Here's an example of the viseme output.
205+
206+
# [Viseme ID](#tab/visemeid)
160207
161208
```text
162209
(Viseme), Viseme ID: 1, Audio offset: 200ms.
@@ -168,20 +215,101 @@ Here is an example of the viseme output.
168215
(Viseme), Viseme ID: 13, Audio offset: 2350ms.
169216
```
170217

171-
After you obtain the viseme output, you can use these events to drive character animation. You can build your own characters and automatically animate them.
218+
# [2D SVG](#tab/2dsvg)
172219

173-
For 2D characters, you can design a character that suits your scenario and use Scalable Vector Graphics (SVG) for each viseme ID to get a time-based face position. With temporal tags that are provided in a viseme event, these well-designed SVGs will be processed with smoothing modifications, and provide robust animation to the users. For example, the following illustration shows a red-lipped character that's designed for language learning.
220+
The SVG output is a xml string that contains the animation.
221+
Render the SVG animation along with the synthesized speech to see the mouth movement.
174222

175-
![Screenshot showing a 2D rendering example of four red-lipped mouths, each representing a different viseme ID that corresponds to a phoneme.](media/text-to-speech/viseme-demo-2D.png)
223+
```xml
224+
<svg width= "1200px" height= "1200px" ..>
225+
<g id= "front_start" stroke= "none" stroke-width= "1" fill= "none" fill-rule= "evenodd">
226+
<animate attributeName= "d" begin= "d_dh_front_background_1_0.end" dur= "0.27500
227+
...
228+
```
176229
177-
For 3D characters, think of the characters as string puppets. The puppet master pulls the strings from one state to another and the laws of physics do the rest and drive the puppet to move fluidly. The viseme output acts as a puppet master to provide an action timeline. The animation engine defines the physical laws of action. By interpolating frames with easing algorithms, the engine can further generate high-quality animations.
230+
# [3D blend shapes](#tab/3dblendshapes)
178231
179-
## Map phonemes to visemes
180232
181-
Visemes vary by language and locale. Each locale has a set of visemes that correspond to its specific phonemes. The [SSML phonetic alphabets](speech-ssml-phonetic-sets.md) documentation maps viseme IDs to the corresponding International Phonetic Alphabet (IPA) phonemes.
233+
Each viseme event includes a series of frames in the `Animation` SDK property. These are grouped to best align the facial positions with the audio. Your 3D engine should render each group of `BlendShapes` frames immediately before the corresponding audio chunk. The `FrameIndex` value indicates how many frames preceded the current list of frames.
234+
235+
The output json looks like the following sample. Each frame within `BlendShapes` contains an array of 55 facial positions represented as decimal values between 0 to 1. The decimal values are in the same order as described in the facial positions table below.
182236
237+
```json
238+
{
239+
"FrameIndex":0,
240+
"BlendShapes":[
241+
[0.021,0.321,...,0.258],
242+
[0.045,0.234,...,0.288],
243+
...
244+
]
245+
}
246+
```
247+
248+
The order of `BlendShapes` is as follows.
249+
250+
| Order | Facial position in `BlendShapes`|
251+
| --------- | ----------- |
252+
| 1 | eyeBlinkLeft|
253+
| 2 | eyeLookDownLeft|
254+
| 3 | eyeLookInLeft|
255+
| 4 | eyeLookOutLeft|
256+
| 5 | eyeLookUpLeft|
257+
| 6 | eyeSquintLeft|
258+
| 7 | eyeWideLeft|
259+
| 8 | eyeBlinkRight|
260+
| 9 | eyeLookDownRight|
261+
| 10 | eyeLookInRight|
262+
| 11 | eyeLookOutRight|
263+
| 12 | eyeLookUpRight|
264+
| 13 | eyeSquintRight|
265+
| 14 | eyeWideRight|
266+
| 15 | jawForward|
267+
| 16 | jawLeft|
268+
| 17 | jawRight|
269+
| 18 | jawOpen|
270+
| 19 | mouthClose|
271+
| 20 | mouthFunnel|
272+
| 21 | mouthPucker|
273+
| 22 | mouthLeft|
274+
| 23 | mouthRight|
275+
| 24 | mouthSmileLeft|
276+
| 25 | mouthSmileRight|
277+
| 26 | mouthFrownLeft|
278+
| 27 | mouthFrownRight|
279+
| 28 | mouthDimpleLeft|
280+
| 29 | mouthDimpleRight|
281+
| 30 | mouthStretchLeft|
282+
| 31 | mouthStretchRight|
283+
| 32 | mouthRollLower|
284+
| 33 | mouthRollUpper|
285+
| 34 | mouthShrugLower|
286+
| 35 | mouthShrugUpper|
287+
| 36 | mouthPressLeft|
288+
| 37 | mouthPressRight|
289+
| 38 | mouthLowerDownLeft|
290+
| 39 | mouthLowerDownRight|
291+
| 40 | mouthUpperUpLeft|
292+
| 41 | mouthUpperUpRight|
293+
| 42 | browDownLeft|
294+
| 43 | browDownRight|
295+
| 44 | browInnerUp|
296+
| 45 | browOuterUpLeft|
297+
| 46 | browOuterUpRight|
298+
| 47 | cheekPuff|
299+
| 48 | cheekSquintLeft|
300+
| 49 | cheekSquintRight|
301+
| 50 | noseSneerLeft|
302+
| 51 | noseSneerRight|
303+
| 52 | tongueOut|
304+
| 53 | headRoll|
305+
| 54 | leftEyeRoll|
306+
| 55 | rightEyeRoll|
307+
308+
---
309+
310+
After you obtain the viseme output, you can use these events to drive character animation. You can build your own characters and automatically animate them.
183311
184312
## Next steps
185313
186-
> [!div class="nextstepaction"]
187-
> [SSML phonetic alphabets](speech-ssml-phonetic-sets.md)
314+
- [SSML phonetic alphabets](speech-ssml-phonetic-sets.md)
315+
- [How to improve synthesis with SSML](speech-synthesis-markup.md)

articles/cognitive-services/Speech-Service/language-support.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -961,6 +961,120 @@ There are two Custom Neural Voice (CNV) project types: CNV Pro and CNV Lite (pre
961961
| Turkish (Turkey) | `tr-TR` | No |No|
962962
| Vietnamese (Vietnam) | `vi-VN` | No |No|
963963

964+
### Viseme
965+
966+
A _viseme_ is the visual description of a phoneme in spoken language. It defines the position of the face and mouth while a person is speaking. Each viseme depicts the key facial poses for a specific set of phonemes. Speech audio output can be accompanied by a viseme ID, Scalable Vector Graphics (SVG), or blend shapes. For more information, see [Get facial position with viseme](how-to-speech-synthesis-viseme.md).
967+
968+
> [!NOTE]
969+
> Viseme ID supports [neural voices](#text-to-speech) in the locales listed below. SVG only supports neural voices in the `en-US` locale, and blend shapes supports neural voices in the `en-US` and `zh-CN` locales.
970+
971+
The following table lists the languages supported by viseme ID.
972+
973+
| Language | Locale |
974+
|---|---|
975+
| Arabic (Algeria) | `ar-DZ` |
976+
| Arabic (Bahrain) | `ar-BH` |
977+
| Arabic (Egypt) | `ar-EG` |
978+
| Arabic (Iraq) | `ar-IQ` |
979+
| Arabic (Jordan) | `ar-JO` |
980+
| Arabic (Kuwait) | `ar-KW` |
981+
| Arabic (Lebanon) | `ar-LB` |
982+
| Arabic (Libya) | `ar-LY` |
983+
| Arabic (Morocco) | `ar-MA` |
984+
| Arabic (Oman) | `ar-OM` |
985+
| Arabic (Qatar) | `ar-QA` |
986+
| Arabic (Saudi Arabia) | `ar-SA` |
987+
| Arabic (Syria) | `ar-SY` |
988+
| Arabic (Tunisia) | `ar-TN` |
989+
| Arabic (United Arab Emirates) | `ar-AE` |
990+
| Arabic (Yemen) | `ar-YE` |
991+
| Bulgarian (Bulgaria) | `bg-BG` |
992+
| Catalan (Spain) | `ca-ES` |
993+
| Chinese (Cantonese, Traditional) | `zh-HK` |
994+
| Chinese (Mandarin, Simplified) | `zh-CN` |
995+
| Chinese (Taiwanese Mandarin) | `zh-TW` |
996+
| Croatian (Croatia) | `hr-HR` |
997+
| Czech (Czech) | `cs-CZ` |
998+
| Danish (Denmark) | `da-DK` |
999+
| Dutch (Belgium) | `nl-BE` |
1000+
| Dutch (Netherlands) | `nl-NL` |
1001+
| English (Australia) | `en-AU` |
1002+
| English (Canada) | `en-CA` |
1003+
| English (Hongkong) | `en-HK` |
1004+
| English (India) | `en-IN` |
1005+
| English (Ireland) | `en-IE` |
1006+
| English (Kenya) | `en-KE` |
1007+
| English (New Zealand) | `en-NZ` |
1008+
| English (Nigeria) | `en-NG` |
1009+
| English (Philippines) | `en-PH` |
1010+
| English (Singapore) | `en-SG` |
1011+
| English (South Africa) | `en-ZA` |
1012+
| English (Tanzania) | `en-TZ` |
1013+
| English (United Kingdom) | `en-GB` |
1014+
| English (United States) | `en-US` |
1015+
| Finnish (Finland) | `fi-FI` |
1016+
| French (Belgium) | `fr-BE` |
1017+
| French (Canada) | `fr-CA` |
1018+
| French (France) | `fr-FR` |
1019+
| French (Switzerland) | `fr-CH` |
1020+
| German (Austria) | `de-AT` |
1021+
| German (Germany) | `de-DE` |
1022+
| German (Switzerland) | `de-CH` |
1023+
| Greek (Greece) | `el-GR` |
1024+
| Gujarati (India) | `gu-IN` |
1025+
| Hebrew (Israel) | `he-IL` |
1026+
| Hindi (India) | `hi-IN` |
1027+
| Hungarian (Hungary) | `hu-HU` |
1028+
| Indonesian (Indonesia) | `id-ID` |
1029+
| Italian (Italy) | `it-IT` |
1030+
| Japanese (Japan) | `ja-JP` |
1031+
| Korean (Korea) | `ko-KR` |
1032+
| Malay (Malaysia) | `ms-MY` |
1033+
| Marathi (India) | `mr-IN` |
1034+
| Norwegian (Bokmål, Norway) | `nb-NO` |
1035+
| Polish (Poland) | `pl-PL` |
1036+
| Portuguese (Brazil) | `pt-BR` |
1037+
| Portuguese (Portugal) | `pt-PT` |
1038+
| Romanian (Romania) | `ro-RO` |
1039+
| Russian (Russia) | `ru-RU` |
1040+
| Slovak (Slovakia) | `sk-SK` |
1041+
| Slovenian (Slovenia) | `sl-SI` |
1042+
| Spanish (Argentina) | `es-AR` |
1043+
| Spanish (Bolivia) | `es-BO` |
1044+
| Spanish (Chile) | `es-CL` |
1045+
| Spanish (Colombia) | `es-CO` |
1046+
| Spanish (Costa Rica) | `es-CR` |
1047+
| Spanish (Cuba) | `es-CU` |
1048+
| Spanish (Dominican Republic) | `es-DO` |
1049+
| Spanish (Ecuador) | `es-EC` |
1050+
| Spanish (El Salvador) | `es-SV` |
1051+
| Spanish (Equatorial Guinea) | `es-GQ` |
1052+
| Spanish (Guatemala) | `es-GT` |
1053+
| Spanish (Honduras) | `es-HN` |
1054+
| Spanish (Mexico) | `es-MX` |
1055+
| Spanish (Nicaragua) | `es-NI` |
1056+
| Spanish (Panama) | `es-PA` |
1057+
| Spanish (Paraguay) | `es-PY` |
1058+
| Spanish (Peru) | `es-PE` |
1059+
| Spanish (Puerto Rico) | `es-PR` |
1060+
| Spanish (Spain) | `es-ES` |
1061+
| Spanish (Uruguay) | `es-UY` |
1062+
| Spanish (US) | `es-US` |
1063+
| Spanish (Venezuela) | `es-VE` |
1064+
| Swahili (Tanzania) | `sw-TZ` |
1065+
| Swedish (Sweden) | `sv-SE` |
1066+
| Tamil (India) | `ta-IN` |
1067+
| Tamil (Malaysia) | `ta-MY` |
1068+
| Tamil (Singapore) | `ta-SG` |
1069+
| Tamil (Sri Lanka) | `ta-LK` |
1070+
| Telugu (India) | `te-IN` |
1071+
| Thai (Thailand) | `th-TH` |
1072+
| Turkish (Turkey) | `tr-TR` |
1073+
| Ukrainian (Ukraine) | `uk-UA` |
1074+
| Urdu (India) | `ur-IN` |
1075+
| Urdu (Pakistan) | `ur-PK` |
1076+
| Vietnamese (Vietnam) | `vi-VN` |
1077+
9641078
## Language identification
9651079

9661080
With language identification, you set and get one of the supported locales in the following table. We only compare at the language level, such as English and German. If you include multiple locales of the same language, for example, `en-IN` and `en-US`, we'll only compare English (`en`) with the other candidate languages.
-114 KB
Loading

0 commit comments

Comments
 (0)