Skip to content

Commit a1686c4

Browse files
committed
per Darren
1 parent cb50ae1 commit a1686c4

File tree

10 files changed

+33
-34
lines changed

10 files changed

+33
-34
lines changed

articles/cognitive-services/Speech-Service/captioning-concepts.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ zone_pivot_groups: programming-languages-speech-sdk
1515

1616
# Captioning with speech to text
1717

18-
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
18+
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
1919

2020
Here are some common captioning scenarios:
2121
- Online courses and instructional videos
@@ -46,21 +46,21 @@ You'll want to synchronize captions with the audio track, whether it's done in r
4646

4747
The Speech service returns the offset and duration of the recognized speech.
4848

49-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition doesn't necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50-
- **Duration**: Duration of the utterance that is being recognized. The duration time span doesn't include trailing or leading silence.
49+
- **Offset**: Used to measure the relative position of the speech that is currently being recognized. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50+
- **Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
5151

5252
For more information, see [Get speech recognition results](get-speech-recognition-results.md).
5353

5454
## Get partial results
5555

56-
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial intermediate results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
56+
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial partial results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
5757

5858
> [!NOTE]
59-
> Punctuation of intermediate results is not available.
59+
> Punctuation of partial results is not available.
6060
6161
For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.
6262

63-
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial intermediate results".
63+
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
6464

6565
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
6666

@@ -119,7 +119,7 @@ RECOGNIZING: Text=welcome to applied mathematics course 201
119119
RECOGNIZED: Text=Welcome to applied Mathematics course 201.
120120
```
121121

122-
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the intermediate results were inaccurate. In either case, the unstable intermediate results can be perceived as "flickering" when displayed.
122+
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed.
123123

124124
For this example, if the stable partial result threshold is set to `5`, no words are altered or backtracked.
125125

@@ -138,7 +138,7 @@ You can specify whether to mask, remove, or show profanity in recognition result
138138
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
139139
140140
The profanity filter options are:
141-
- `Masked`: Replaces letters in profane words with asterisk (*) characters.
141+
- `Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
142142
- `Raw`: Include the profane words verbatim.
143143
- `Removed`: Removes profane words.
144144

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```csharp
2625
speechRecognizer->StartContinuousRecognitionAsync().get();

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```csharp
2625
await speechRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
author: eric-urban
3+
ms.service: cognitive-services
4+
ms.subservice: speech-service
5+
ms.topic: include
6+
ms.date: 04/13/2022
7+
ms.author: eur
8+
---
9+
10+
- **Offset**: The duration offset into the stream of audio that is being recognized. Offset starts incrementing in ticks from `0` (zero) when the first audio byte is processed by the SDK. For example, the offset begins incrementing from `0` (zero) when you start continuous recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second.
11+
- **Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```go
2625
speechRecognizer.StartContinuousRecognitionAsync()

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```java
2625
speechRecognizer.startContinuousRecognitionAsync().get();

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```javascript
2625
speechRecognizer.startContinuousRecognitionAsync();

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2421

2522
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
2623

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2221

23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
2423

2524
```python
2625
speech_recognizer.start_continuous_recognition()

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
2421

2522
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
2623

0 commit comments

Comments
 (0)