Skip to content

Commit e6dc3dd

Browse files
authored
Merge pull request #267367 from sally-baolian/patch-203
Update custom-avatar-record-video-samples.md
2 parents 7615781 + 7178c61 commit e6dc3dd

14 files changed

+94
-27
lines changed

articles/ai-services/speech-service/text-to-speech-avatar/custom-avatar-record-video-samples.md

Lines changed: 94 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: How to record video samples for custom text to speech avatar - Speech service
33
titleSuffix: Azure AI services
4-
description: Learn how to prepare high-quality video samples for creating a custom text to speech avatar
4+
description: Learn how to prepare high-quality video samples for creating a custom text to speech avatar.
55
author: sally-baolian
66
manager: nitinme
77
ms.service: azure-ai-speech
@@ -17,61 +17,128 @@ keywords: how to record video samples for custom text to speech avatar
1717

1818
This article provides instructions on preparing high-quality video samples for creating a custom text to speech avatar.
1919

20-
Custom text to speech avatar model building requires training on a video recording of a real human speaking. This person is the avatar talent. You must get sufficient consent under all relevant laws and regulations from the avatar talent to create a custom avatar from their talent's image or likeness.
20+
Custom text to speech avatar model building requires training on a video recording of a real human speaking. This person is the avatar talent. You must get sufficient consent under all relevant laws and regulations from the avatar talent to create a custom avatar from their talent's image or likeness. Refer to [Get consent file from the avatar talent](custom-avatar-create.md#get-consent-file-from-the-avatar-talent) to learn requirement of consent statement video.
2121

2222
## Recording environment
2323

24-
- We recommend recording in a professional video shooting studio or a well-lit place with a clean background.
25-
- The background of the video should be clean, smooth, pure-colored, and a green screen is the best choice.
26-
- Ensure even and bright lighting on the actor's face, avoiding shadows on face or reflections on actor's glasses and clothes.
27-
- Camera requirement: A minimum of 1080-P resolution and 36 FPS.
28-
- Other devices: You can use a teleprompter to remind the script during recording but ensure it doesn't affect the actor's gaze towards the camera. Provide a seat if the avatar needs to be in a sitting position.
24+
We recommend recording in a professional video shooting studio or a well-lit place.
2925

30-
## Appearance of the actor
26+
### Background requirement
3127

32-
The custom text to speech avatar doesn't support customization of clothes or looks. Therefore, it's essential to carefully design and prepare the avatar's appearance when recording the training data. Consider the following tips:
28+
- If you need a commercial, multi-scene avatar, the background of the video should be clean, smooth, pure-colored, and a green screen is the best choice.
29+
- If your avatar only needs to be used in a single scene, you can select a specific scene to record (such as in your office), but the background can't be subtracted and changed.
30+
- Tips about using a pure-colored background (such as green screen) in shooting:
31+
32+
| Dos | Don'ts |
33+
|-----------------------------------------|--------------------------------------------------------|
34+
| - A green screen is set behind your back, and if your avatar video shows the full body of the actor, including feet, there should be a green screen under the feet. And the back green screen and floor green screen should be completely connected. <br/>- The green screen should be flat, and the color is uniform.<br/> - The actor should keep 0.5 m – 1 m distance away from the back background.<br/>- The green screen can be properly lit to prevent shadows.<br/>- The full outline of the actor is within the edge of the green screen.| - The actor shouldn't stand too close to the green screen.<br/>- Avoid the actor’s head and hands spilling out of the green screen when speaking.|
35+
36+
### Lighting requirement
37+
38+
- Ensure even and bright lighting on the actor's face, avoiding shadows on the face or reflections on actor's glasses and clothes.
39+
- Try to avoid the impact of changes in ambient light on actors. It's recommended to turn off the projector, close the curtains to avoid daylight changes, and use a stable artificial light source, etc.
3340

34-
- The actor's hair should have a smooth and glossy surface, avoiding messy hair or backgrounds showing through the hair.
41+
### Devices
42+
43+
- Camera requirement: A minimum of 1080-P resolution and 25 FPS (frames per second).
44+
- Don't change the position of light and camera after settling down during the whole video shooting.
45+
- You can use a teleprompter to remind the script during recording but ensure it doesn't affect the actor's gaze towards the camera. Provide a seat if the avatar needs to be in a sitting position.
46+
- For half-length or seated digital avatars, provide a seat for the actor. If you don't want the image of the chair to appear, you can choose a simple chair.
47+
48+
## Appearance of the actor
3549

36-
- Avoid wearing clothing that is too similar to the background color or reflective materials like white shirts. Avoid clothing with obvious lines or items with logos and brand names you don't want to highlight.
50+
The custom text to speech avatar doesn't support customization of clothes or looks. Therefore, it's essential to carefully design and prepare the avatar's appearance when recording the training data. Consider the following tips:
3751

38-
- Ensure the actor's face is clearly visible, not obscured by hair, sunglasses, or accessories.
52+
| Categories | Dos | Don'ts |
53+
|------------|----------------------------|-------------------|
54+
| **Hair** | - The actor’s hair should have a smooth and glossy surface.</br>- Even the actor’s bangs or broken hair should have a clear and smooth border.</br>- Choose a hairstyle that is easy to keep consistent during the whole video recording. | - Avoid messy hair or backgrounds showing through the hair.</br>- Do not let hair block the eyes or eyebrows.</br>- Avoid shadows on the face caused by hairstyle.</br>- Avoid hair changes too much during speech and body gesture. For example, the high ponytail of an actor may appear, disappear, and swing during speaking. |
55+
| **Clothing** | - Pay attention to clothing status and make sure no significant changes on the clothing during speaking. | - Avoid wearing clothing and accessories that are too loose, heavy, or complex, as they may impact the consistency of clothing status during speaking and body gesture.</br>- Avoid wearing clothing that is too similar to the background color or reflective materials like white shirts or translucent materials.</br>- Avoid clothing with obvious lines or items with logos and brand names you don't want to highlight.</br>- Avoid reflective elements such as metal belts, shiny leather shoes, and leather pants. |
56+
| **Face** | - Ensure the actor's face is clearly visible. | - Avoid face obscured by hair, sunglasses, or accessories. |
3957

4058
## What video clips to record
4159

4260
You need three types of basic video clips:
4361

4462
**Status 0 speaking:**
4563
- Status 0 represents the posture you can naturally maintain most of the time while speaking. For example, arms crossed in front of the body or hanging down naturally at the sides.
46-
- Maintain a front-facing pose with minimal body movement. The actor can nod slightly, but don't move the body too much.
64+
- Maintain a front-facing pose. The actor can move slightly to show a relaxed status, like moving the head or shoulder slightly, but don't move the body too much.
4765
- Length: keep speaking in status 0 for 3-5 minutes.
66+
67+
**Samples of status 0 speaking:**
68+
69+
![Animated graphic depicting Lisa speaking in status 0, representing the posture naturally maintained while speaking.](media/status-0-lisa.gif)
70+
71+
![Animated graphic depicting Harry speaking in status 0, representing the posture naturally maintained while speaking.](media/status-0-harry.gif)
72+
73+
![Animated graphic depicting Lori speaking in status 0, representing the posture naturally maintained while speaking.](media/status-0-lori.gif)
4874

4975
**Naturally speaking:**
5076
- Actor speaks in status 0 but with natural hand gestures from time to time.
5177
- Hands should start from status 0 and return after making gestures.
5278
- Use natural and common gestures when speaking. Avoid meaningful gestures like pointing, applause, or thumbs up.
5379
- Length: Minimum 5 minutes, maximum 30 minutes in total. At least one piece of 5-minute continuous video recording is required. If recording multiple video clips, keep each clip under 10 minutes.
80+
81+
**Samples of natural speaking:**
82+
83+
![Animated graphic depicting sample of Lisa speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.](media/natural-lisa.gif)
84+
85+
![Animated graphic depicting sample of Harry speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.](media/natural-harry.gif)
86+
87+
![Animated graphic depicting sample of Lori speaking in status 0 with natural hand gestures, representing the posture naturally maintained while speaking.](media/natural-lori.gif)
5488

5589
**Silent status:**
56-
- Maintain status 0 but don't speak.
57-
- Maintain a smile or nodding head as if listening or waiting.
58-
- Length: 1 minute.
5990

60-
Here are more tips for recording video clips:
91+
This video clip is important if you build a real-time conversation with the custom avatar. The video clip is used as the main template for both speaking and listening status for a chatbot.
6192

62-
- Ensure all video clips are taken in the same conditions.
63-
- Mind facial expressions, which should be suitable for the avatar's use case. For example, look positive and be smile if the custom text to speech avatar will be used as customer service, and look professionally if the avatar will be used for news reporting.
64-
- Maintain eye gaze towards the camera, even when using a teleprompter
65-
- Return your body to status 0 when pausing speaking.
66-
- Speak on a self-chosen topic, and minor speech mistakes like miss a word or mispronounced are acceptable. If the actor misses a word or mispronounces something, just go back to status 0, pause for 3 seconds, and then continue speaking.
67-
- Consciously pause between sentences and paragraphs. When pausing, go back to the status 0 and close your lips.
68-
- Maintain high-quality audio, avoiding background noise, like other people's voice.
93+
- Maintain status 0, don't speak, but still feel relaxed.
94+
- Even remaining in status 0, don't keep completely still; you can move a little bit but not too much. Perform like you're waiting.
95+
- Maintain a smile as if listening or waiting patiently.
96+
- Length: 1 minute.
97+
98+
**Samples of silent status:**
99+
100+
![Animated graphic depicting sample of Lisa maintaining silent status without speaking but still feeling relaxed.](media/silent-lisa.gif)
101+
102+
![Animated graphic depicting sample of Harry maintaining silent status without speaking but still feeling relaxed.](media/silent-harry.gif)
103+
104+
![Animated graphic depicting sample of Lori maintaining silent status without speaking but still feeling relaxed.](media/silent-lori.gif)
105+
106+
**Gestures (optional):**
107+
108+
Gesture video clips are optional, and customers who have the need to insert certain gestures in the avatar speaking can follow this guideline to take gesture videos. Gesture insertion is only enabled for batch mode avatar; real-time avatar doesn’t support gesture insertion at this point. Each custom avatar model can support no more than 10 gestures.
109+
110+
**Gesture tips:**
111+
- Each gesture clip should be within 10 seconds.
112+
- Gestures should start from status 0 and end with status 0; otherwise, the gesture clip can't be smoothly inserted into the avatar video.
113+
- The gesture clip only captures the body gestures; the actor doesn’t have to speak during making gestures.
114+
- We recommend designing a list of gestures before recording; here are some examples of gesture video clips:
115+
116+
**Samples of gesture:**
117+
118+
| Gestures | Samples |
119+
|--------------------------------|------------------------|
120+
| Delivering sell link/promotion code | ![An animated graphic depicting sample of delivering sell link.](media/delivering-sell-link.gif) |
121+
| Introducing the product | ![An animated graphic depicting sample of introducing the product.](media/introducing-the-product.gif) |
122+
| Displaying the price (number from 1 to 10-fist-number with each hand) | Right hand ![An animated graphic depicting sample of displaying the price with right hand.](media/displaying-the-price-with-right-hand.gif) Left hand ![An animated graphic depicting sample of displaying the price with left hand.](media/displaying-the-price-with-left-hand.gif) |
123+
124+
High-quality avatar models are built from high-quality video recordings, including audio quality. Here are more tips for actor’s performance and recording video clips:
125+
126+
| **Dos** | **Don'ts** |
127+
|---------|--------------|
128+
| - Ensure all video clips are taken in the same conditions.</br>- During the recording process, design the size and display area of the character you need so that the character can be displayed on the screen appropriately.</br> - Actor should be steady during the recording. </br> - Mind facial expressions, which should be suitable for the avatar's use case. For example, look positive and smile if the custom text to speech avatar is used as customer service. Look professionally if the avatar is used for news reporting.</br> - Maintain eye gaze towards the camera, even when using a teleprompter.</br> - Return your body to status 0 when pausing speaking.</br> - Speak on a self-chosen topic, and minor speech mistakes like miss a word or mispronounced are acceptable. If the actor misses a word or mispronounces something, just go back to status 0, pause for 3 seconds, and then continue speaking.</br> - Consciously pause between sentences and paragraphs. When pausing, go back to the status 0 and close your lips. </br> - The audio should be clear and loud enough; bad audio quality impacts training result.</br> - Keep the shooting environment quiet. | - Don't adjust the camera parameters, focal length, position, angle of view. Don't move the camera; keep the person's position, size, angle, consistent in the camera.</br> - Characters that are too small may lead to a loss of image quality during post-processing. Characters that are too large may cause the screen to overflow during gestures and movements.</br> - Don't make too long gestures or too much movement for one gesture; for example, actor’s hands are always making gestures and forget to go back to status 0.</br> - The actor's movements and gestures must not block the face.</br> - Avoid small movements of the actor like licking lips, touching hair, talking sideways, constant head shaking during speech, and not closing up after speaking.</br> - Avoid background noise; staff should avoid walking and talking during video recording.</br> - Avoid other people’s voice recorded during the actor speaking. |
69129

70130
## Data requirements
71131

72-
- Avatar training video recording file format: .mp4 or .mov.
73-
- Resolution: At least 1920x1080.
74-
- Frame rate per second: At least 25 FPS.
132+
Doing some basic processing of your video data is helpful for model training efficiency, such as:
133+
134+
- Make sure that the character is in the middle of the screen, the size and position are consistent during the video processing. Each video processing parameter such as brightness, contrast remains the same and doesn't change.
135+
- The start and end of the clip should be kept in state 0; the actors should close their mouths and smile, and look ahead. The video should be continuous, not abrupt.
136+
137+
**Avatar training video recording file format:** .mp4 or .mov.
138+
139+
**Resolution:** At least 1920x1080.
140+
141+
**Frame rate per second:** At least 25 FPS.
75142

76143
## Next steps
77144

8.68 MB
Loading
10.7 MB
Loading
23.6 MB
Loading
3.32 MB
Loading
9.1 MB
Loading
20.4 MB
Loading
3.54 MB
Loading
1.96 MB
Loading
6.94 MB
Loading

0 commit comments

Comments
 (0)