You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
18
+
In this guide, you learn how to create captions with speech to text. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing. Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
19
19
20
20
Here are some common captioning scenarios:
21
21
- Online courses and instructional videos
@@ -46,21 +46,21 @@ You'll want to synchronize captions with the audio track, whether it's done in r
46
46
47
47
The Speech service returns the offset and duration of the recognized speech.
48
48
49
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition doesn't necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span doesn't include trailing or leading silence.
49
+
-**Offset**: Used to measure the relative position of the speech that is currently being recognized. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
50
+
-**Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
51
51
52
52
For more information, see [Get speech recognition results](get-speech-recognition-results.md).
53
53
54
54
## Get partial results
55
55
56
-
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial intermediate results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
56
+
Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial partial results are returned with each `Recognizing` event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the `Recognized` event.
57
57
58
58
> [!NOTE]
59
-
> Punctuation of intermediate results is not available.
59
+
> Punctuation of partial results is not available.
60
60
61
61
For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.
62
62
63
-
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial intermediate results".
63
+
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
64
64
65
65
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
RECOGNIZED: Text=Welcome to applied Mathematics course 201.
120
120
```
121
121
122
-
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the intermediate results were inaccurate. In either case, the unstable intermediate results can be perceived as "flickering" when displayed.
122
+
In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed.
123
123
124
124
For this example, if the stable partial result threshold is set to `5`, no words are altered or backtracked.
125
125
@@ -138,7 +138,7 @@ You can specify whether to mask, remove, or show profanity in recognition result
138
138
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
139
139
140
140
The profanity filter options are:
141
-
-`Masked`: Replaces letters in profane words with asterisk (*) characters.
141
+
-`Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
-**Offset**: The duration offset into the stream of audio that is being recognized. Offset starts incrementing in ticks from `0` (zero) when the first audio byte is processed by the SDK. For example, the offset begins incrementing from `0` (zero) when you start continuous recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second.
11
+
-**Duration**: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/objectivec.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
24
21
25
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/python.md
+2-3Lines changed: 2 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,9 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
22
21
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
22
+
Here's an example where the offset starts incrementing in ticks from `0` (zero):
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/swift.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,10 +17,7 @@ You might want to synchronize transcriptions with an audio track, whether it's d
17
17
18
18
The Speech service returns the offset and duration of the recognized speech.
19
19
20
-
-**Offset**: Used to measure the relative position of the speech that is currently being recognized, from the time that you started speech recognition. Speech recognition does not necessarily start at the beginning of the audio track. Offset is measured in ticks, where a single tick represents one hundred nanoseconds or one ten-millionth of a second.
21
-
-**Duration**: Duration of the utterance that is being recognized. The duration time span does not include trailing or leading silence.
22
-
23
-
As soon as you start continuous recognition, the offset starts incrementing in ticks from `0` (zero).
20
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
24
21
25
22
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
0 commit comments