You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/content-understanding/video/overview.md
+6-8Lines changed: 6 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ This format can drop straight into a vector store to enable an agent or RAG work
32
32
From there you can **customize the analyzer** for more fine-grained control of the output. You can define custom fields, segments, or enable face identification. Customization allows you to use the full power of generative models to extract deep insights from the visual and audio details of the video. For example, customization allows you to:
33
33
34
34
- Identify what products and brands are seen or mentioned in the video.
35
-
- Segment a basketball video by different plays such as `offensive play`, `defensive play`, `free throw`.
35
+
- Segment a news broadcast into chapters based on the topics or news stories discussed.
36
36
- Use face identification to label speakers as executives, for example, `CEO John Doe`, `CFO Jane Smith`.
37
37
38
38
## Why use Content Understanding for video?
@@ -46,7 +46,7 @@ Content understanding for video has broad potential uses. For example, you can c
46
46
47
47
## Prebuilt video analyzer example
48
48
49
-
With the prebuilt video analyzer, you can upload a video and get an immediately usable knowledge asset. The service packages every clip into both richly formatted Markdown and JSON. This process allows your search index or chat agent to ingest without custom glue code.
49
+
With the prebuilt video analyzer (prebuilt-videoAnalyzer), you can upload a video and get an immediately usable knowledge asset. The service packages every clip into both richly formatted Markdown and JSON. This process allows your search index or chat agent to ingest without custom glue code.
50
50
Calling prebuilt-video with no custom schema returns a document like the following (abridged) example:
51
51
52
52
```markdown
@@ -102,9 +102,9 @@ The service operates in two stages. The first stage, content extraction, involve
102
102
103
103
The first pass is all about extracting a first set of details—who's speaking, where are the cuts, and which faces recur. It creates a solid metadata backbone that later steps can reason over.
104
104
105
-
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level and word-level timestamps are available upon request. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. For languages with Fast transcriptions support and for files ≤ 300 MB and/or ≤ 2 hours, transcription time is reduced substantially. Additionally, the following transcription details are important to consider:
105
+
* **Transcription:** Converts conversational audio into searchable and analyzable text-based transcripts in WebVTT format. Sentence-level timestamps are available upon request. Content Understanding supports the full set of Azure AI Speech speech-to-text languages. For languages with Fast transcriptions support and for files ≤ 300 MB and/or ≤ 2 hours, transcription time is reduced substantially. Additionally, the following transcription details are important to consider:
106
106
* **Diarization:** Distinguishes between speakers in a conversation in the output, attributing parts of the transcript to specific speakers.
107
-
* **Multilingual transcription:** Generates multilingual transcripts, applying language/locale per phrase. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
107
+
* **Multilingual transcription:** Generates multilingual transcripts. Language/locale is applied per phrase in the transcriptPhrases output when `returnDetails=true` is set. Deviating from language detection this feature is enabled when no language/locale is specified or language is set to `auto`.
108
108
109
109
> [!WARNING]
110
110
> When multilingual transcription is used, a file with an unsupported locale still produces a result. This result is based on the closest locale but most likely not correct.
@@ -113,8 +113,6 @@ The first pass is all about extracting a first set of details—who's speaking,
113
113
* **Shot detection:** Identifies segments of the video aligned with shot boundaries where possible, allowing for precise editing and repackaging of content with breaks exactly on shot boundaries.
114
114
* **Key frame extraction:** Extracts key frames from videos to represent each shot completely, ensuring each shot has enough key frames to enable field extraction to work effectively.
115
115
116
-
117
-
118
116
## Field extraction and segmentation
119
117
120
118
Next, the generative model layers meaning—tagging scenes, summarizing actions, and slicing footage into segments per your request. This action is where prompts turn into structured data.
@@ -127,7 +125,7 @@ Shape the output to match your business vocabulary. Use a `fieldSchema` object w
127
125
128
126
* **Media asset management:**
129
127
130
-
* **Shot type:** Helps editors and producers organize content, simplifying editing, and understanding the visual language of the video. Useful for metadata tagging and quicker scene retrieval.
128
+
* **Video Category:** Helps editors and producers organize content, by classifying it as News, Sports, Interview, Documentary, Advertisement, etc. Useful for metadata tagging and quicker content filtering and retrieval.
131
129
* **Color scheme:** Conveys mood and atmosphere, essential for narrative consistency and viewer engagement. Identifying color themes helps in finding matching clips for accelerated video editing.
132
130
133
131
* **Advertising:**
@@ -238,7 +236,7 @@ Specific limitations of video processing to keep in mind:
238
236
239
237
***Frame sampling (\~ 1 FPS)**: The analyzer inspects about one frame per second. Rapid motions or single-frame events may be missed.
240
238
***Frame resolution (512 × 512 px)**: Sampled frames are resized to 512 pixels square. Small text or distant objects can be lost.
241
-
***Speech**: Only spoken words are transcribed. Music, sound effects, and ambient noise are ignored. Specific of supported locals are document.
239
+
***Speech**: Only spoken words are transcribed. Music, sound effects, and ambient noise are ignored.
0 commit comments