You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/azure-video-indexer/transcription-translation-lid.md
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,8 @@ Azure AI Video Indexer transcription, translation and language identification au
17
17
- Azure AI Video Indexer processes the speech in the audio file to extract the transcription that is then translated into many languages. When selecting to translate into a specific language, both the transcription and the insights like keywords, topics, labels or OCR are translated into the specified language. Transcription can be used as is or be combined with speaker insights that map and assign the transcripts into speakers. Multiple speakers can be detected in an audio file. An ID is assigned to each speaker and is displayed under their transcribed speech.
18
18
- Azure AI Video Indexer language identification (LID) automatically recognizes the supported dominant spoken language in the video file. For more information, see [Applying LID](/azure/azure-video-indexer/language-identification-model).
19
19
- Azure AI Video Indexer multi-language identification (MLID) automatically recognizes the spoken languages in different segments in the audio file and sends each segment to be transcribed in the identified languages. At the end of this process, all transcriptions are combined into the same file. For more information, see [Applying MLID](/azure/azure-video-indexer/multi-language-identification-transcription).
20
-
The resulting insights are generated in a categorized list in a JSON file that includes the ID, language, transcribed text, duration and confidence score.
20
+
The resulting insights are generated in a categorized list in a JSON file that includes the ID, language, transcribed text, duration and confidence score.
21
+
- When indexing media files with multiple speakers, Azure AI Video Indexer performs speaker diarization which identifies each speaker in a video and attributes each transcribed line to a speaker. The speakers are given a unique identity such as Speaker #1 and Speaker #2. This allows for the identification of speakers during conversations and can be useful in a variety of scenarios such as doctor-patient conversations, agent-customer interactions, and court proceedings.
21
22
22
23
## Prerequisites
23
24
@@ -120,7 +121,15 @@ When used responsibly and carefully, Azure AI Video Indexer is a valuable tool f
120
121
- Provide a feedback channel that allows users and individuals to report issues with the service.
121
122
- Be aware of any applicable laws or regulations that exist in your area regarding processing, analyzing, and sharing media containing people.
122
123
- Keep a human in the loop. Don't use any solution as a replacement for human oversight and decision-making.
123
-
- Fully examine and review the potential of any AI model you're using to understand its capabilities and limitations.
124
+
- Fully examine and review the potential of any AI model you're using to understand its capabilities and limitations.
125
+
- Video Indexer doesn't perform speaker recognition so speakers are not assigned an identifier across multiple files. You are unable to search for an individual speaker in multiple files or transcripts.
126
+
- Speaker identifiers are assigned randomly and can only be used to distinguish different speakers in a single file.
127
+
- Cross-talk and overlapping speech: When multiple speakers talk simultaneously or interrupt each other, it becomes challenging for the model to accurately distinguish and assign the correct text to the corresponding speakers.
128
+
- Speaker overlaps: Sometimes, speakers may have similar speech patterns, accents, or use similar vocabulary, making it difficult for the model to differentiate between them.
129
+
- Noisy audio: Poor audio quality, background noise, or low-quality recordings can hinder the model's ability to correctly identify and transcribe speakers.
130
+
- Emotional Speech: Emotional variations in speech, such as shouting, crying, or extreme excitement, can affect the model's ability to accurately diarize speakers.
131
+
- Speaker disguise or impersonation: If a speaker intentionally tries to imitate or disguise their voice, the model might misidentify the speaker.
132
+
- Ambiguous speaker identification: Some segments of speech may not have enough unique characteristics for the model to confidently attribute to a specific speaker.
124
133
125
134
For more information, see: guidelines and limitations in [language detection and transcription](/azure/azure-video-indexer/multi-language-identification-transcription).
0 commit comments