Skip to content

Commit 36d15c2

Browse files
authored
Merge pull request #4990 from laujan/joe-4973-video-overview
Joe 4973 video overview
2 parents 5719dd7 + b55fade commit 36d15c2

File tree

1 file changed

+67
-57
lines changed
  • articles/ai-services/content-understanding/video

1 file changed

+67
-57
lines changed

articles/ai-services/content-understanding/video/overview.md

Lines changed: 67 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,13 @@ The **pre-built video analyzer** outputs RAG-ready Markdown that includes:
3131

3232
This format can drop straight into a vector store to enable an agent or RAG workflows—no post-processing required.
3333

34-
From there you can **customize the analyzer** for more fine-grained control of the output. You can define custom fields, segments, or enable face identification. Customization allows you to use the full power of generative models to extract deep insights from the visual and audio details of the video. For example, customization allows you to:
34+
From there you can **customize the analyzer** for more fine-grained control of the output. You can define custom fields, segments, or enable face identification. Customization allows you to use the full power of generative models to extract deep insights from the visual and audio details of the video.
3535

36-
- Identify what products and brands are seen or mentioned in the video.
37-
- Segment a news broadcast into chapters based on the topics or news stories discussed.
38-
- Use face identification to label speakers as executives, for example, `CEO John Doe`, `CFO Jane Smith`.
36+
For example, customization allows you to:
37+
38+
- **Define custom fields:** to identify what products and brands are seen or mentioned in the video.
39+
- **Generate custom segments:** to segment a news broadcast into chapters based on the topics or news stories discussed.
40+
- **Identify people using a person directory** enabling a customer to label conference speakers in footage using face identification, for example, `CEO John Doe`, `CFO Jane Smith`.
3941

4042
## Why use Content Understanding for video?
4143

@@ -49,39 +51,53 @@ Content understanding for video has broad potential uses. For example, you can c
4951
## Prebuilt video analyzer example
5052

5153
With the prebuilt video analyzer (prebuilt-videoAnalyzer), you can upload a video and get an immediately usable knowledge asset. The service packages every clip into both richly formatted Markdown and JSON. This process allows your search index or chat agent to ingest without custom glue code.
52-
Calling prebuilt-video with no custom schema returns a document like the following (abridged) example:
5354

54-
```markdown
55-
# Video: 00:00.000 → 00:30.000
56-
Width: 1280 · Height: 720
55+
* For example, creating the base `prebuilt-videoAnalyzer` as follows:
56+
57+
```jsonc
58+
{
59+
"config": {},
60+
"BaseAnalyzerId": "prebuilt-videoAnalyzer",
61+
}
62+
```
63+
64+
* Next, analyzing a 30-second advertising video, would result in the following output:
65+
66+
```markdown
67+
# Video: 00:00.000 => 00:30.000
68+
Width: 1280
69+
Height: 720
70+
71+
## Segment 1: 00:00.000 => 00:06.000
72+
A lively room filled with people is shown, where a group of friends is gathered around a television. They are watching a sports event, possibly a football match, as indicated by the decorations and the atmosphere.
73+
74+
Transcript
5775

58-
## Segment 1 00:00.000 → 00:06.400
59-
A lively gathering in a room decorated with colorful banners and balloons. Party guests watch a TV showing a sports event while a young man kneels excitedly in front. Snacks and drinks underline the festive mood.
76+
WEBVTT
6077

61-
**Transcript**
62-
WEBVTT
63-
00:03.600 → 00:06.000 <1 Speaker> Get New Years ready.
78+
00:03.600 --> 00:06.000
79+
<Speaker 1 Speaker>Get new years ready.
6480

65-
**Key frames**
66-
- 00:00.600 ![KF](keyFrame.600.jpg)
67-
- 00:01.200 ![KF](keyFrame.1200.jpg)
68-
- 00:02.560 ![KF](keyFrame.2560.jpg)
69-
- 00:03.280 ![KF](keyFrame.3280.jpg)
70-
- 00:04.560 ![KF](keyFrame.4560.jpg)
71-
- 00:05.600 ![KF](keyFrame.5600.jpg)
72-
- 00:06.200 ![KF](keyFrame.6200.jpg)
81+
Key Frames
82+
- 00:00.600 ![](keyFrame.600.jpg)
83+
- 00:01.200 ![](keyFrame.1200.jpg)
7384

74-
## Segment 2 00:06.400 → 00:10.080
75-
The room erupts into a vibrant party scene—people dancing under soccer-themed décor, flags waving, energy soaring.
85+
## Segment 2: 00:06.000 => 00:10.080
86+
The scene transitions to a more vibrant and energetic setting, where the group of friends is now celebrating. The room is decorated with football-themed items, and everyone is cheering and enjoying the moment.
7687

77-
**Key frames**
78-
- 00:07.080 ![KF](keyFrame.7080.jpg)
79-
- 00:07.760 ![KF](keyFrame.7760.jpg)
80-
- 00:08.560 ![KF](keyFrame.8560.jpg)
81-
- 00:09.360 ![KF](keyFrame.9360.jpg)
88+
Transcript
8289

83-
*…additional segments omitted for brevity…*
84-
````
90+
WEBVTT
91+
92+
00:03.600 --> 00:06.000
93+
<Speaker 1 Speaker>Go team!
94+
95+
Key Frames
96+
- 00:06.200 ![](keyFrame.6200.jpg)
97+
- 00:07.080 ![](keyFrame.7080.jpg)
98+
99+
*…additional data omitted for brevity…*
100+
```
85101

86102
## Walk-through
87103

@@ -104,17 +120,17 @@ The service operates in two stages. The first stage, content extraction, involve
104120

105121
The first pass is all about extracting a first set of details—who's speaking, where are the cuts, and which faces recur. It creates a solid metadata backbone that later steps can reason over.
106122

107-
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level timestamps are available if `returnDetails=true` is set. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. For more information on supported languages, *see* [Language and region support](../language-region-support.md#language-support). The following transcription details are important to consider:
123+
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level timestamps are available if `"returnDetails": true` is set. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. Details of language support for video are the same as audio, *see* [Audio Language Handling](../audio/overview.md#language-handling) for details. The following transcription details are important to consider:
108124

109125
* **Diarization:** Distinguishes between speakers in a conversation in the output, attributing parts of the transcript to specific speakers.
110-
* **Multilingual transcription:** Generates multilingual transcripts. Language/locale is applied per phrase in the transcript. Phrases output when `returnDetails=true` is set. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
126+
* **Multilingual transcription:** Generates multilingual transcripts. Language/locale is applied per phrase in the transcript. Phrases output when `"returnDetails": true` is set. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
111127

112128
> [!NOTE]
113129
> When Multilingual transcription is used, any files with unsupported locales produce a result based on the closest supported locale, which is likely incorrect. This result is a known
114130
> behavior. Avoid transcription quality issues by ensuring that you configure locales when not using a multilingual transcription supported locale!
115131
116-
* **Shot detection:** Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly on shot boundaries.
117-
* **Key frame extraction:** Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable field extraction to work effectively.
132+
* **Key frame extraction:** Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable field extraction to work effectively.
133+
* **Shot detection:** Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly existing edits. The output is a list of timestamps in milliseconds in `cameraShotTimesMs`. The output is only returned when `"returnDetails": true` is set.
118134

119135
## Field extraction and segmentation
120136

@@ -139,6 +155,7 @@ Shape the output to match your business vocabulary. Use a `fieldSchema` object w
139155
**Example:**
140156

141157
```jsonc
158+
142159
"fieldSchema": {
143160
"description": "Extract brand presence and sentiment per scene",
144161
"fields": {
@@ -161,9 +178,12 @@ Shape the output to match your business vocabulary. Use a `fieldSchema` object w
161178
}
162179
```
163180

181+
### Segmentation mode
164182

183+
> [!NOTE]
184+
>
185+
> Setting segmentation triggers field extraction even if no fields are defined.
165186
166-
### Segmentation mode
167187

168188
Content Understanding offers three ways to slice a video, letting you get the output you need for whole videos or short clips. You can use these options by setting the `SegmentationMode` property on a custom analyzer.
169189

@@ -186,43 +206,39 @@ Content Understanding offers three ways to slice a video, letting you get the ou
186206

187207
**Example:**
188208
* Break a news broadcast up into stories.
189-
```jsonc
190-
{
191-
"segmentationMode": "custom",
192-
"segmentationDefinition": "news broadcasts divided by individual stories"
193-
}
194-
```
195209

196-
> [!NOTE]
197-
>
198-
> Setting segmentation triggers field extraction even if no fields are defined.
210+
```jsonc
211+
{
212+
"segmentationMode": "custom",
213+
"segmentationDefinition": "news broadcasts divided by individual stories"
214+
}
215+
```
199216

200217
## Face identification description add-on
201218

202-
Face identification description is an add-on that provides context to content extraction and field extraction using face information.
203-
204219
> [!NOTE]
205220
>
206221
> This feature is limited access and involves face identification and grouping; customers need to register for access at [Face Recognition](https://aka.ms/facerecognition). Face features incur added costs.
207222

208-
### Content extraction: grouping and identification
223+
Face identification description is an add-on that provides context to content extraction and field extraction using face information.
224+
225+
### Content extraction - Grouping and identification
209226

210-
The face add-on enables grouping and identification as output from the content extraction section. To enable face capabilities set `enableFace=true` in the analyzer configuration.
227+
The face add-on enables grouping and identification as output from the content extraction section. To enable face capabilities set `"enableFace":true` in the analyzer configuration.
211228

212229
* **Grouping:** Grouped faces appearing in a video to extract one representative face image for each person and provides segments where each one is present. The grouped face data is available as metadata and can be used to generate customized metadata fields when `returnDetails: true` for the analyzer.
213-
* **Identification:** Labels individuals in the video with names based on a Face API person directory. Customers can enable this feature by supplying a name for a Face API directory in the current resource in the `personDirectoryId` property of the analyzer.
230+
* **Identification:** Labels individuals in the video with names based on a Face API person directory. Customers can enable this feature by supplying a name for a Face API directory in the current resource in the `personDirectoryId` property of the analyzer. To use this capability, first you must create a personDirectory then reference it in the analyzer. For details on how to do that, check out [How to build a person directory](../../content-understanding/tutorial/build-person-directory.md)
214231

215232
### Field Extraction – Face description
216233

217-
The field extraction capability is enhanced by providing detailed descriptions of identified faces in the video. This capability includes attributes such as facial hair, emotions, and the presence of celebrities, which can be crucial for various analytical and indexing purposes.
234+
The field extraction capability is enhanced by providing detailed descriptions of identified faces in the video. This capability includes attributes such as facial hair, emotions, and the presence of celebrities, which can be crucial for various analytical and indexing purposes. To enable face capabilities set `disableFaceBlurring=true` in the analyzer configuration.
218235

219236
**Examples:**
220237

221238
* **Example field: emotionDescription:** Provides a description of the emotional state of the primary person in this clip (for example, `happy`, `sad`, `angry`)
222239
* **Example field: facialHairDescription:** Describes the type of facial hair (for example, `beard`, `mustache`, `clean-shaven`)
223240

224241

225-
226242
## Key benefits
227243

228244
Content Understanding provides several key benefits when compared to other video analysis solutions:
@@ -245,14 +261,10 @@ Specific limitations of video processing to keep in mind:
245261

246262
For supported formats, see [Service quotas and limits](../service-limits.md).
247263

248-
249-
250264
## Supported languages and regions
251265

252266
See [Language and region support](../language-region-support.md).
253267

254-
255-
256268
## Data privacy and security
257269

258270
As with all Azure AI services, review Microsoft's [Data, protection, and privacy](https://www.microsoft.com/trust-center/privacy) documentation.
@@ -261,8 +273,6 @@ As with all Azure AI services, review Microsoft's [Data, protection, and privacy
261273
>
262274
> If you process **Biometric Data** (for example, enable **Face Grouping** or **Face Identification**), you must meet all notice, consent, and deletion requirements under GDPR or other applicable laws. See [Data and Privacy for Face](/legal/cognitive-services/face/data-privacy-security).
263275
264-
265-
266276
## Next steps
267277
268278
* Process videos in the [Azure AI Foundry portal](https://aka.ms/cu-landing).

0 commit comments

Comments
 (0)