You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
18
+
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
19
19
20
20
Here are some common captioning scenarios:
21
21
- Online courses and instructional videos
@@ -46,21 +46,20 @@ You'll want to synchronize captions with the audio track, whether it's done in r
46
46
47
47
The Speech service returns the offset and duration of the recognized speech.
48
48
49
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition doesn't necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span doesn't include trailing or leading silence.
49
+
[!INCLUDE [Define offset and duration](includes/how-to/recognize-speech-results/define-offset-duration.md)]
51
50
52
51
For more information, see [Get speech recognition results](get-speech-recognition-results.md).
53
52
54
53
## Get partial results
55
54
56
-
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial intermediate results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
55
+
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial partial results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
57
56
58
57
> [!NOTE]
59
-
> Punctuation of intermediate results is not available.
58
+
> Punctuation of partial results is not available.
60
59
61
60
For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.
62
61
63
-
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial intermediate results".
62
+
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
64
63
65
64
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
RECOGNIZED: Text=Welcome to applied Mathematics course 201.
120
119
```
121
120
122
-
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the intermediate results were inaccurate. In either case, the unstable intermediate results can be perceived as "flickering" when displayed.
121
+
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed.
123
122
124
123
For this example, if the stable partial result threshold is set to `5`, no words are altered or backtracked.
125
124
@@ -138,7 +137,7 @@ You can specify whether to mask, remove, or show profanity in recognition result
138
137
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
139
138
140
139
The profanity filter options are:
141
-
-`Masked`: Replaces letters in profane words with asterisk (*) characters.
140
+
-`Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
-**Offset**: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from `0` (zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second.
11
+
-**Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20
+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
24
21
25
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,14 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
24
-
25
-
```python
26
-
speech_recognizer.start_continuous_recognition()
27
-
```
20
+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
28
21
29
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20
+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
24
21
25
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
0 commit comments