Merge pull request #195010 from eric-urban/eur/caption-feedback

laurabren · web-flow · commit 43d019b695d2 · 2022-04-13T14:32:22.000-07:00
per Darren
diff --git a/articles/cognitive-services/Speech-Service/captioning-concepts.md b/articles/cognitive-services/Speech-Service/captioning-concepts.md
@@ -15,7 +15,7 @@ zone_pivot_groups: programming-languages-speech-sdk
 
 # Captioning with speech to text
 
-In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
+In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
 
 Here are some common captioning scenarios:
 - Online courses and instructional videos
@@ -46,21 +46,20 @@ You'll want to synchronize captions with the audio track, whether it's done in r
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition doesn't necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span doesn't include trailing or leading silence. 
+[!INCLUDE [Define offset and duration](includes/how-to/recognize-speech-results/define-offset-duration.md)]
 
 For more information, see [Get speech recognition results](get-speech-recognition-results.md).
 
 ## Get partial results
 
-Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial intermediate results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
+Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial partial results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
 
 > [!NOTE]
-> Punctuation of intermediate results is not available. 
+> Punctuation of partial results is not available. 
 
 For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.
 
-Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial intermediate results". 
+Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results". 
 
 You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
 
@@ -119,7 +118,7 @@ RECOGNIZING: Text=welcome to applied mathematics course 201
 RECOGNIZED: Text=Welcome to applied Mathematics course 201.
 ```
 
-In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the intermediate results were inaccurate. In either case, the unstable intermediate results can be perceived as "flickering" when displayed. 
+In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed. 
 
 For this example, if the stable partial result threshold is set to `5`, no words are altered or backtracked.
 
@@ -138,7 +137,7 @@ You can specify whether to mask, remove, or show profanity in recognition result
 > Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
 
 The profanity filter options are:
-- `Masked`: Replaces letters in profane words with asterisk (*) characters.
+- `Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
 - `Raw`: Include the profane words verbatim.
 - `Removed`: Removes profane words.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```csharp
-speechRecognizer->StartContinuousRecognitionAsync().get();
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```csharp
-await speechRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/define-offset-duration.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/define-offset-duration.md
@@ -0,0 +1,11 @@
+---
+author: eric-urban
+ms.service: cognitive-services
+ms.subservice: speech-service
+ms.topic: include
+ms.date: 04/13/2022
+ms.author: eur
+---
+
+- **Offset**: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from `0` (zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. 
+- **Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence. 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```go
-speechRecognizer.StartContinuousRecognitionAsync()
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```java
-speechRecognizer.startContinuousRecognitionAsync().get();
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```javascript
-speechRecognizer.startContinuousRecognitionAsync();
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
-
-```python
-speech_recognizer.start_continuous_recognition()
-```
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
 
diff --git a/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md b/articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
 
 The Speech service returns the offset and duration of the recognized speech. 
 
-- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
-- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence. 
-
-As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero). 
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
 
 The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.