Skip to content

Commit 839769c

Browse files
authored
Merge pull request #79663 from PanosPeriorellis/master
Update batch-transcription.md
2 parents 8623020 + 164b1fb commit 839769c

File tree

2 files changed

+40
-6
lines changed

2 files changed

+40
-6
lines changed

articles/cognitive-services/Speech-Service/batch-transcription.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,40 @@ Polling for transcription status may not be the most performant, or provide the
9797

9898
For more details, see [Webhooks](webhooks.md).
9999

100+
## Speaker Separation (Diarization)
101+
102+
Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports Diarization and is capable of recognizing two speakers on mono channel recordings.
103+
104+
To request that your audio transcription request is processed for diarization, you simply have to add the relevant parameter in the HTTP request as shown below.
105+
106+
```json
107+
{
108+
"recordingsUrl": "<URL to the Azure blob to transcribe>",
109+
"models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
110+
"locale": "<locale to us, for example en-US>",
111+
"name": "<user defined name of the transcription batch>",
112+
"description": "<optional description of the transcription>",
113+
"properties": {
114+
"AddWordLevelTimestamps" : "True",
115+
"AddDiarization" : "True"
116+
}
117+
}
118+
```
119+
120+
Word level timestamps would also have to be 'turned on' as the parameters in the above request indicate.
121+
122+
The corresponding audio will contain the speakers identified by a number (currently we support only two voices, so the speakers will be identified as 'Speaker 1 'and 'Speaker 2') followed by the transcription output.
123+
124+
Also note that Diarization is not available in Stereo recordings. Furthermore, all JSON output will contain the Speaker tag. If diarization is not used, it will show 'Speaker: Null' in the JSON output.
125+
126+
Supported locales are listed below.
127+
128+
| Language | locale |
129+
|--------|-------|
130+
| English | en-US |
131+
| Chinese | zh-CN |
132+
| Deutsch | de-DE |
133+
100134
## Sentiment
101135

102136
Sentiment is a new feature in Batch Transcription API and is an important feature in the call center domain. Customers can use the `AddSentiment` parameters to their requests to
@@ -107,7 +141,7 @@ Sentiment is a new feature in Batch Transcription API and is an important featur
107141
4. Pinpoint what went well when turning negative calls to positive
108142
5. Identify what customers like and what they dislike about a product or a service
109143

110-
Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. The entire text within that segment is used to calculate sentiment. We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. These are left to the domain owner to further apply.
144+
Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. The entire text within that segment is used to calculate sentiment. We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. These aggregations are left to the domain owner to further apply.
111145

112146
Sentiment is applied on the lexical form.
113147

@@ -146,7 +180,7 @@ A JSON output sample looks like below:
146180
]
147181
}
148182
```
149-
The features uses a Sentiment model which is currently in Beta.
183+
The feature uses a Sentiment model, which is currently in Beta.
150184

151185
## Sample code
152186

articles/cognitive-services/Speech-Service/how-to-custom-speech-test-data.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,11 @@ If there are uncommon terms without standard pronunciations that your users will
133133
134134
This includes examples of a spoken utterance, and a custom pronunciation for each:
135135

136-
| Spoken form | Recognized/displayed form |
136+
| Recognized/displayed form | Spoken form |
137137
|--------------|--------------------------|
138-
| three c p o | 3CPO |
139-
| c n t k | CNTK |
140-
| i triple e | IEEE |
138+
| 3CPO | three c p o |
139+
| CNTK | c n t k |
140+
| IEEE | i triple e |
141141

142142
The spoken form is the phonetic sequence spelled out. It can be composed of letter, words, syllables, or a combination of all three.
143143

0 commit comments

Comments
 (0)