Skip to content

Commit 43d019b

Browse files
authored
Merge pull request #195010 from eric-urban/eur/caption-feedback
per Darren
2 parents f5ed28b + eeb5bec commit 43d019b

File tree

10 files changed

+26
-64
lines changed

10 files changed

+26
-64
lines changed

articles/cognitive-services/Speech-Service/captioning-concepts.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ zone_pivot_groups: programming-languages-speech-sdk
1515

1616
# Captioning with speech to text
1717

18-
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
18+
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
1919

2020
Here are some common captioning scenarios:
2121
- Online courses and instructional videos
@@ -46,21 +46,20 @@ You'll want to synchronize captions with the audio track, whether it's done in r
4646

4747
The Speech service returns the offset and duration of the recognized speech.
4848

49-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition doesn't necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50-
- **Duration**: Duration of the utterance that is being recognized. The duration time span doesn't include trailing or leading silence.
49+
[!INCLUDE [Define offset and duration](includes/how-to/recognize-speech-results/define-offset-duration.md)]
5150

5251
For more information, see [Get speech recognition results](get-speech-recognition-results.md).
5352

5453
## Get partial results
5554

56-
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial intermediate results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
55+
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial partial results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
5756

5857
> [!NOTE]
59-
> Punctuation of intermediate results is not available.
58+
> Punctuation of partial results is not available.
6059
6160
For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.
6261

63-
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial intermediate results".
62+
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
6463

6564
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
6665

@@ -119,7 +118,7 @@ RECOGNIZING: Text=welcome to applied mathematics course 201
119118
RECOGNIZED: Text=Welcome to applied Mathematics course 201.
120119
```
121120

122-
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the intermediate results were inaccurate. In either case, the unstable intermediate results can be perceived as "flickering" when displayed.
121+
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed.
123122

124123
For this example, if the stable partial result threshold is set to `5`, no words are altered or backtracked.
125124

@@ -138,7 +137,7 @@ You can specify whether to mask, remove, or show profanity in recognition result
138137
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
139138
140139
The profanity filter options are:
141-
- `Masked`: Replaces letters in profane words with asterisk (*) characters.
140+
- `Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
142141
- `Raw`: Include the profane words verbatim.
143142
- `Removed`: Removes profane words.
144143

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```csharp
26-
speechRecognizer->StartContinuousRecognitionAsync().get();
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```csharp
26-
await speechRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
author: eric-urban
3+
ms.service: cognitive-services
4+
ms.subservice: speech-service
5+
ms.topic: include
6+
ms.date: 04/13/2022
7+
ms.author: eur
8+
---
9+
10+
- **Offset**: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from `0` (zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second.
11+
- **Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```go
26-
speechRecognizer.StartContinuousRecognitionAsync()
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```java
26-
speechRecognizer.startContinuousRecognitionAsync().get();
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```javascript
26-
speechRecognizer.startContinuousRecognitionAsync();
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2421

2522
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
2623

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24-
25-
```python
26-
speech_recognizer.start_continuous_recognition()
27-
```
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2821

2922
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
3023

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
1717

1818
The Speech service returns the offset and duration of the recognized speech.
1919

20-
- **Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21-
- **Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22-
23-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
2421

2522
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
2623

0 commit comments

Comments
 (0)