Merge pull request #79663 from PanosPeriorellis/master

PRMerger15 · web-flow · commit 839769c6becc · 2019-06-14T16:37:02.000+08:00
Update batch-transcription.md
diff --git a/articles/cognitive-services/Speech-Service/batch-transcription.md b/articles/cognitive-services/Speech-Service/batch-transcription.md
@@ -97,6 +97,40 @@ Polling for transcription status may not be the most performant, or provide the
 
 For more details, see [Webhooks](webhooks.md).
 
+## Speaker Separation (Diarization)
+
+Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports Diarization and is capable of recognizing two speakers on mono channel recordings.
+
+To request that your audio transcription request is processed for diarization, you simply have to add the relevant parameter in the HTTP request as shown below.
+
+ ```json
+{
+  "recordingsUrl": "<URL to the Azure blob to transcribe>",
+  "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
+  "locale": "<locale to us, for example en-US>",
+  "name": "<user defined name of the transcription batch>",
+  "description": "<optional description of the transcription>",
+  "properties": {
+    "AddWordLevelTimestamps" : "True",
+    "AddDiarization" : "True"
+  }
+}
+```
+
+Word level timestamps would also have to be 'turned on' as the parameters in the above request indicate. 
+
+The corresponding audio will contain the speakers identified by a number (currently we support only two voices, so the speakers will be identified as 'Speaker 1 'and 'Speaker 2') followed by the transcription output.
+
+Also note that Diarization is not available in Stereo recordings. Furthermore, all JSON output will contain the Speaker tag. If diarization is not used, it will show 'Speaker: Null' in the JSON output.
+
+Supported locales are listed below.
+
+| Language | locale |
+|--------|-------|
+| English | en-US |
+| Chinese | zh-CN |
+| Deutsch | de-DE |
+
 ## Sentiment
 
 Sentiment is a new feature in Batch Transcription API and is an important feature in the call center domain. Customers can use the `AddSentiment` parameters to their requests to 
@@ -107,7 +141,7 @@ Sentiment is a new feature in Batch Transcription API and is an important featur
 4.	Pinpoint what went well when turning negative calls to positive
 5.	Identify what customers like and what they dislike about a product or a service
 
-Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. The entire text within that segment is used to calculate sentiment. We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. These are left to the domain owner to further apply.
+Sentiment is scored per audio segment where an audio segment is defined as the time lapse between the start of the utterance (offset) and the detection silence of end of byte stream. The entire text within that segment is used to calculate sentiment. We DO NOT calculate any aggregate sentiment values for the entire call or the entire speech of each channel. These aggregations are left to the domain owner to further apply.
 
 Sentiment is applied on the lexical form.
 
@@ -146,7 +180,7 @@ A JSON output sample looks like below:
   ]
 }
 ```
-The features uses a Sentiment model which is currently in Beta.
+The feature uses a Sentiment model, which is currently in Beta.
 
 ## Sample code
 
diff --git a/articles/cognitive-services/Speech-Service/how-to-custom-speech-test-data.md b/articles/cognitive-services/Speech-Service/how-to-custom-speech-test-data.md
@@ -133,11 +133,11 @@ If there are uncommon terms without standard pronunciations that your users will
 
 This includes examples of a spoken utterance, and a custom pronunciation for each:
 
-| Spoken form | Recognized/displayed form |
+| Recognized/displayed form | Spoken form |
 |--------------|--------------------------|
-| three c p o | 3CPO |  
-| c n t k | CNTK |
-| i triple e | IEEE |
+| 3CPO | three c p o |  
+| CNTK | c n t k |
+| IEEE | i triple e |
 
 The spoken form is the phonetic sequence spelled out. It can be composed of letter, words, syllables, or a combination of all three.