You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -27,12 +27,37 @@ The following are aspects to consider when using captioning:
27
27
* Center captions horizontally on the screen, in a large and prominent font.
28
28
* Consider whether to use partial results, when to start displaying captions, and how many words to show at a time.
29
29
* Learn about captioning protocols such as [SMPTE-TT](https://ieeexplore.ieee.org/document/7291854).
30
-
* Consider output formats such as SRT (SubRip Subtitle) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
30
+
* Consider output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
31
31
32
32
> [!TIP]
33
33
> Try the [Azure Video Indexer](/azure/azure-video-indexer/video-indexer-overview) as a demonstration of how you can get captions for videos that you upload.
34
34
35
-
Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video.
35
+
Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) or [Speech CLI](spx-overview.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video.
36
+
37
+
## Caption output format
38
+
39
+
The Speech service supports output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
40
+
41
+
The [SRT](https://docs.fileformat.com/video/srt/) (SubRip Text) timespan output format is `hh:mm:ss,fff`.
42
+
43
+
```srt
44
+
1
45
+
00:00:00,180 --> 00:00:03,230
46
+
Welcome to applied Mathematics course 201.
47
+
```
48
+
49
+
The [WebVTT](https://www.w3.org/TR/webvtt1/#introduction) (Web Video Text Tracks) timespan output format is `hh:mm:ss,fff`.
50
+
51
+
```
52
+
WEBVTT
53
+
54
+
00:00:00.180 --> 00:00:03.230
55
+
Welcome to applied Mathematics course 201.
56
+
{
57
+
"ResultId": "8e89437b4b9349088a933f8db4ccc263",
58
+
"Duration": "00:00:03.0500000"
59
+
}
60
+
```
36
61
37
62
## Input audio to the Speech service
38
63
@@ -61,7 +86,7 @@ For captioning of prerecorded speech or wherever latency isn't a concern, you co
61
86
62
87
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
63
88
64
-
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
89
+
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold`property value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
Requesting more stable partial results will reduce the "flickering" or changing text, but it can increase latency as you wait for higher confidence results.
Profanity filter is applied to the result `Text` and `MaskedNormalizedForm` properties. Profanity filter isn't applied to the result `LexicalForm` and `NormalizedForm` properties. Neither is the filter applied to the word level results.
188
223
@@ -204,5 +239,5 @@ There are some situations where [training a custom model](custom-speech-overview
204
239
205
240
## Next steps
206
241
207
-
*[Get started with speech to text](get-started-speech-to-text.md)
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/improve-accuracy-phrase-list.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -53,11 +53,11 @@ Now try Speech Studio to see how phrase list can improve recognition accuracy.
53
53
1. Sign in to [Speech Studio](https://speech.microsoft.com/).
54
54
1. Select **Real-time Speech-to-text**.
55
55
1. You test speech recognition by uploading an audio file or recording audio with a microphone. For example, select **record audio with a microphone** and then say "Hi Rehaan, this is Jessie from Contoso bank. " Then select the red button to stop recording.
56
-
1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jesse", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
56
+
1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jessie", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
57
57
1. Select **Show advanced options** and turn on **Phrase list**.
58
58
1. Enter "Contoso;Jessie;Rehaan" in the phrase list text box. Note that multiple phrases need to be separated by a semicolon.
59
59
:::image type="content" source="./media/custom-speech/phrase-list-after-zoom.png" alt-text="Screenshot of a phrase list applied in Speech Studio." lightbox="./media/custom-speech/phrase-list-after-full.png":::
60
-
1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jesse", or "Contoso" should be recognized.
60
+
1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jessie", or "Contoso" should be recognized.
You might want to synchronize transcriptions with an audio track, whether it's done in real time or with a prerecording.
15
+
16
+
The Speech service returns the offset and duration of the recognized speech.
17
+
18
+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
19
+
20
+
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
21
+
22
+
### Recognizing offset and duration
23
+
24
+
You'll want to synchronize captions with the audio track, whether it's done in real time or with a prerecording. With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
25
+
26
+
For example, run the following command to get the offset and duration of the recognized speech:
27
+
28
+
```console
29
+
spx recognize --file caption.this.mp4 --format any --output each file - @output.each.detailed
30
+
```
31
+
32
+
Since the `@output.each.detailed` argument was set, the output includes the following column headers:
In the `result.json` column, you can find details that include offset and duration for the `Recognizing` and `Recognized` events:
39
+
40
+
```json
41
+
{
42
+
"Id": "492574cd8555481a92c22f5ff757ef17",
43
+
"RecognitionStatus": "Success",
44
+
"DisplayText": "Welcome to applied Mathematics course 201.",
45
+
"Offset": 1800000,
46
+
"Duration": 30500000
47
+
}
48
+
```
49
+
50
+
For more information, see the Speech CLI [datastore configuration](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md) and [output options](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md).
51
+
52
+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
25
25
26
26
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
27
27
28
+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
cout << "Offset in Ticks:" << e.Result->Offset() << std::endl;
35
+
cout << "Duration in Ticks:" << e.Result->Duration() << std::endl;
36
+
});
37
+
```
38
+
28
39
### Recognized offset and duration
29
40
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
25
25
26
26
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
27
27
28
+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
fmt.Println("Offset in Ticks:", event.Result.Offset)
35
+
fmt.Println("Duration in Ticks:", event.Result.Duration)
36
+
}
37
+
```
38
+
28
39
### Recognized offset and duration
29
40
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
25
25
26
26
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
27
27
28
+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
System.out.println("Offset in Ticks: "+ e.getResult().getOffset());
34
+
System.out.println("Duration in Ticks: "+ e.getResult().getDuration());
35
+
});
36
+
```
37
+
28
38
### Recognized offset and duration
29
39
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
25
25
26
26
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
27
27
28
+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
29
+
30
+
```javascript
31
+
speechRecognizer.recognizing=function (s, e) {
32
+
console.log("RECOGNIZING: "+e.result.text);
33
+
console.log("Offset in Ticks: "+e.result.offset);
34
+
console.log("Duration in Ticks: "+e.result.duration);
35
+
};
36
+
```
37
+
28
38
### Recognized offset and duration
29
39
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
0 commit comments