Skip to content

Commit 251b955

Browse files
committed
Updating multilingual transcription note aligned with audio.
1 parent fd96efc commit 251b955

File tree

1 file changed

+17
-16
lines changed
  • articles/ai-services/content-understanding/video

1 file changed

+17
-16
lines changed

articles/ai-services/content-understanding/video/overview.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ ms.date: 05/19/2025
2020
2121
Azure AI Content Understanding allows you to generate a standard set of video metadata and create custom fields for your specific use case using the power of generative models. Content Understanding helps efficiently manage, categorize, retrieve, and build workflows for video assets. It enhances your media asset library, supports workflows such as highlight generation, categorizes content, and facilitates applications like retrieval-augmented generation (RAG).
2222

23+
:::image type="content" source="../media/video/video-processing-flow.png" alt-text="Illustration of the Content Understanding video processing flow.":::
24+
2325
The **pre-built video analyzer** outputs RAG-ready Markdown that includes:
2426

2527
- **Transcript:** Inline transcripts in standard WEBVTT format
@@ -32,7 +34,7 @@ This format can drop straight into a vector store to enable an agent or RAG work
3234
From there you can **customize the analyzer** for more fine-grained control of the output. You can define custom fields, segments, or enable face identification. Customization allows you to use the full power of generative models to extract deep insights from the visual and audio details of the video. For example, customization allows you to:
3335

3436
- Identify what products and brands are seen or mentioned in the video.
35-
- Segment a basketball video by different plays such as `offensive play`, `defensive play`, `free throw`.
37+
- Segment a news broadcast into chapters based on the topics or news stories discussed.
3638
- Use face identification to label speakers as executives, for example, `CEO John Doe`, `CFO Jane Smith`.
3739

3840
## Why use Content Understanding for video?
@@ -46,7 +48,7 @@ Content understanding for video has broad potential uses. For example, you can c
4648

4749
## Prebuilt video analyzer example
4850

49-
With the prebuilt video analyzer, you can upload a video and get an immediately usable knowledge asset. The service packages every clip into both richly formatted Markdown and JSON. This process allows your search index or chat agent to ingest without custom glue code.
51+
With the prebuilt video analyzer (prebuilt-videoAnalyzer), you can upload a video and get an immediately usable knowledge asset. The service packages every clip into both richly formatted Markdown and JSON. This process allows your search index or chat agent to ingest without custom glue code.
5052
Calling prebuilt-video with no custom schema returns a document like the following (abridged) example:
5153

5254
```markdown
@@ -102,19 +104,18 @@ The service operates in two stages. The first stage, content extraction, involve
102104

103105
The first pass is all about extracting a first set of details—who's speaking, where are the cuts, and which faces recur. It creates a solid metadata backbone that later steps can reason over.
104106

105-
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level and word-level timestamps are available upon request. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. For languages with Fast transcriptions support and for files ≤ 300 MB and/or ≤ 2 hours, transcription time is reduced substantially. Additionally, the following transcription details are important to consider:
107+
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level timestamps are available if `returnDetails=true` is set. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. For more information on supported languages, *see* [Language and region support](../language-region-support.md#language-support). The following transcription details are important to consider:
108+
106109
* **Diarization:** Distinguishes between speakers in a conversation in the output, attributing parts of the transcript to specific speakers.
107-
* **Multilingual transcription:** Generates multilingual transcripts, applying language/locale per phrase. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
110+
* **Multilingual transcription:** Generates multilingual transcripts. Language/locale is applied per phrase in the transcript. Phrases output when `returnDetails=true` is set. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
111+
112+
> [!NOTE]
113+
> When Multilingual transcription is used, any files with unsupported locales produce a result based on the closest supported locale, which is likely incorrect. This result is a known
114+
> behavior. Avoid transcription quality issues by ensuring that you configure locales when not using a multilingual transcription supported locale!
108115

109-
> [!WARNING]
110-
> When multilingual transcription is used, a file with an unsupported locale still produces a result. This result is based on the closest locale but most likely not correct.
111-
> This transcription behavior is known. Make sure to configure locales when not using multilingual transcription!
112-
113116
* **Shot detection:** Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly on shot boundaries.
114117
* **Key frame extraction:** Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable field extraction to work effectively.
115118

116-
117-
118119
## Field extraction and segmentation
119120

120121
Next, the generative model layers meaning—tagging scenes, summarizing actions, and slicing footage into segments per your request. This action is where prompts turn into structured data.
@@ -127,7 +128,7 @@ Shape the output to match your business vocabulary. Use a `fieldSchema` object w
127128

128129
* **Media asset management:**
129130

130-
* **Shot type:** Helps editors and producers organize content, simplifying editing, and understanding the visual language of the video. Useful for metadata tagging and quicker scene retrieval.
131+
* **Video Category:** Helps editors and producers organize content, by classifying it as News, Sports, Interview, Documentary, Advertisement, etc. Useful for metadata tagging and quicker content filtering and retrieval.
131132
* **Color scheme:** Conveys mood and atmosphere, essential for narrative consistency and viewer engagement. Identifying color themes helps in finding matching clips for accelerated video editing.
132133

133134
* **Advertising:**
@@ -201,12 +202,12 @@ Content Understanding offers three ways to slice a video, letting you get the ou
201202
Face identification description is an add-on that provides context to content extraction and field extraction using face information.
202203

203204
> [!NOTE]
204-
>
205-
> Face features incur additional cost. This feature is limited access and involves face identification and grouping; customers need to register for access at Face Recognition.
205+
>
206+
> This feature is limited access and involves face identification and grouping; customers need to register for access at [Face Recognition](https://aka.ms/facerecognition). Face features incur added costs.
206207
207208
### Content extraction: grouping and identification
208209

209-
The face add-on enables grouping and identification as output from the content extraction section.
210+
The face add-on enables grouping and identification as output from the content extraction section. To enable face capabilities set `enableFace=true` in the analyzer configuration.
210211

211212
* **Grouping:** Grouped faces appearing in a video to extract one representative face image for each person and provides segments where each one is present. The grouped face data is available as metadata and can be used to generate customized metadata fields when `returnDetails: true` for the analyzer.
212213
* **Identification:** Labels individuals in the video with names based on a Face API person directory. Customers can enable this feature by supplying a name for a Face API directory in the current resource in the `personDirectoryId` property of the analyzer.
@@ -238,7 +239,7 @@ Specific limitations of video processing to keep in mind:
238239

239240
* **Frame sampling (\~ 1 FPS)**: The analyzer inspects about one frame per second. Rapid motions or single-frame events may be missed.
240241
* **Frame resolution (512 × 512 px)**: Sampled frames are resized to 512 pixels square. Small text or distant objects can be lost.
241-
* **Speech**: Only spoken words are transcribed. Music, sound effects, and ambient noise are ignored. Specific of supported locals are document.
242+
* **Speech**: Only spoken words are transcribed. Music, sound effects, and ambient noise are ignored.
242243

243244
## Input requirements
244245

@@ -257,7 +258,7 @@ See [Language and region support](../language-region-support.md).
257258
As with all Azure AI services, review Microsoft's [Data, protection, and privacy](https://www.microsoft.com/trust-center/privacy) documentation.
258259

259260
> [!IMPORTANT]
260-
>
261+
>
261262
> If you process **Biometric Data** (for example, enable **Face Grouping** or **Face Identification**), you must meet all notice, consent, and deletion requirements under GDPR or other applicable laws. See [Data and Privacy for Face](/legal/cognitive-services/face/data-privacy-security).
262263
263264

0 commit comments

Comments
 (0)