Skip to content

Commit 18b287b

Browse files
Merge pull request #195964 from eric-urban/eur/caption-quickstart
init quickstart
2 parents ae25ae9 + 69ba681 commit 18b287b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1134
-84
lines changed

articles/cognitive-services/Speech-Service/captioning-concepts.md

Lines changed: 40 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.subservice: speech-service
1010
ms.topic: conceptual
1111
ms.date: 04/12/2022
1212
ms.author: eur
13-
zone_pivot_groups: programming-languages-speech-sdk
13+
zone_pivot_groups: programming-languages-speech-sdk-cli
1414
---
1515

1616
# Captioning with speech to text
@@ -27,12 +27,37 @@ The following are aspects to consider when using captioning:
2727
* Center captions horizontally on the screen, in a large and prominent font.
2828
* Consider whether to use partial results, when to start displaying captions, and how many words to show at a time.
2929
* Learn about captioning protocols such as [SMPTE-TT](https://ieeexplore.ieee.org/document/7291854).
30-
* Consider output formats such as SRT (SubRip Subtitle) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
30+
* Consider output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
3131

3232
> [!TIP]
3333
> Try the [Azure Video Indexer](/azure/azure-video-indexer/video-indexer-overview) as a demonstration of how you can get captions for videos that you upload.
3434
35-
Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video.
35+
Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) or [Speech CLI](spx-overview.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video.
36+
37+
## Caption output format
38+
39+
The Speech service supports output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
40+
41+
The [SRT](https://docs.fileformat.com/video/srt/) (SubRip Text) timespan output format is `hh:mm:ss,fff`.
42+
43+
```srt
44+
1
45+
00:00:00,180 --> 00:00:03,230
46+
Welcome to applied Mathematics course 201.
47+
```
48+
49+
The [WebVTT](https://www.w3.org/TR/webvtt1/#introduction) (Web Video Text Tracks) timespan output format is `hh:mm:ss,fff`.
50+
51+
```
52+
WEBVTT
53+
54+
00:00:00.180 --> 00:00:03.230
55+
Welcome to applied Mathematics course 201.
56+
{
57+
"ResultId": "8e89437b4b9349088a933f8db4ccc263",
58+
"Duration": "00:00:03.0500000"
59+
}
60+
```
3661

3762
## Input audio to the Speech service
3863

@@ -61,7 +86,7 @@ For captioning of prerecorded speech or wherever latency isn't a concern, you co
6186

6287
Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results".
6388

64-
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
89+
You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` property value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
6590

6691
::: zone pivot="programming-language-csharp"
6792
```csharp
@@ -103,6 +128,11 @@ self.speechConfig!.setPropertyTo(5, by: SPXPropertyId.speechServiceResponseStabl
103128
speech_config.set_property(property_id = speechsdk.PropertyId.SpeechServiceResponse_StablePartialResultThreshold, value = 5)
104129
```
105130
::: zone-end
131+
::: zone pivot="programming-language-cli"
132+
```console
133+
spx recognize --file caption.this.mp4 --format any --property SpeechServiceResponse_StablePartialResultThreshold=5 --output vtt file - --output srt file -
134+
```
135+
::: zone-end
106136

107137
Requesting more stable partial results will reduce the "flickering" or changing text, but it can increase latency as you wait for higher confidence results.
108138

@@ -183,6 +213,11 @@ self.speechConfig!.setProfanityOptionTo(SPXSpeechConfigProfanityOption_Profanity
183213
speech_config.set_profanity(speechsdk.ProfanityOption.Removed)
184214
```
185215
::: zone-end
216+
::: zone pivot="programming-language-cli"
217+
```console
218+
spx recognize --file caption.this.mp4 --format any --profanity masked --output vtt file - --output srt file -
219+
```
220+
::: zone-end
186221

187222
Profanity filter is applied to the result `Text` and `MaskedNormalizedForm` properties. Profanity filter isn't applied to the result `LexicalForm` and `NormalizedForm` properties. Neither is the filter applied to the word level results.
188223

@@ -204,5 +239,5 @@ There are some situations where [training a custom model](custom-speech-overview
204239

205240
## Next steps
206241

207-
* [Get started with speech to text](get-started-speech-to-text.md)
242+
* [Captioning quickstart](captioning-quickstart.md)
208243
* [Get speech recognition results](get-speech-recognition-results.md)
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
---
2+
title: "Create captions with speech to text quickstart - Speech service"
3+
titleSuffix: Azure Cognitive Services
4+
description: In this quickstart, you convert speech to text as captions.
5+
services: cognitive-services
6+
author: eric-urban
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: speech-service
10+
ms.topic: quickstart
11+
ms.date: 04/23/2022
12+
ms.author: eur
13+
ms.devlang: cpp, csharp
14+
zone_pivot_groups: programming-languages-speech-sdk-cli
15+
---
16+
17+
# Quickstart: Create captions with speech to text
18+
19+
::: zone pivot="programming-language-csharp"
20+
[!INCLUDE [C# include](includes/quickstarts/captioning/csharp.md)]
21+
::: zone-end
22+
23+
::: zone pivot="programming-language-cpp"
24+
[!INCLUDE [C++ include](includes/quickstarts/captioning/cpp.md)]
25+
::: zone-end
26+
27+
::: zone pivot="programming-language-go"
28+
[!INCLUDE [Go include](includes/quickstarts/captioning/go.md)]
29+
::: zone-end
30+
31+
::: zone pivot="programming-language-java"
32+
[!INCLUDE [Java include](includes/quickstarts/captioning/java.md)]
33+
::: zone-end
34+
35+
::: zone pivot="programming-language-javascript"
36+
[!INCLUDE [JavaScript include](includes/quickstarts/captioning/javascript.md)]
37+
::: zone-end
38+
39+
::: zone pivot="programming-language-objectivec"
40+
[!INCLUDE [ObjectiveC include](includes/quickstarts/captioning/objectivec.md)]
41+
::: zone-end
42+
43+
::: zone pivot="programming-language-swift"
44+
[!INCLUDE [Swift include](includes/quickstarts/captioning/swift.md)]
45+
::: zone-end
46+
47+
::: zone pivot="programming-language-python"
48+
[!INCLUDE [Python include](./includes/quickstarts/captioning/python.md)]
49+
::: zone-end
50+
51+
::: zone pivot="programming-language-cli"
52+
[!INCLUDE [CLI include](includes/quickstarts/captioning/cli.md)]
53+
::: zone-end
54+
55+
## Next steps
56+
57+
> [!div class="nextstepaction"]
58+
> [Learn more about speech recognition](how-to-recognize-speech.md)

articles/cognitive-services/Speech-Service/get-speech-recognition-results.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ms.topic: how-to
1111
ms.date: 03/31/2022
1212
ms.author: eur
1313
ms.devlang: cpp, csharp, golang, java, javascript, objective-c, python
14-
zone_pivot_groups: programming-languages-speech-sdk
14+
zone_pivot_groups: programming-languages-speech-sdk-cli
1515
keywords: speech to text, speech to text software
1616
---
1717

@@ -49,6 +49,10 @@ keywords: speech to text, speech to text software
4949
[!INCLUDE [Python include](./includes/how-to/recognize-speech-results/python.md)]
5050
::: zone-end
5151

52+
::: zone pivot="programming-language-cli"
53+
[!INCLUDE [CLI include](./includes/how-to/recognize-speech-results/cli.md)]
54+
::: zone-end
55+
5256
## Next steps
5357

5458
* [Try the speech to text quickstart](get-started-speech-to-text.md)

articles/cognitive-services/Speech-Service/improve-accuracy-phrase-list.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,11 @@ Now try Speech Studio to see how phrase list can improve recognition accuracy.
5353
1. Sign in to [Speech Studio](https://speech.microsoft.com/).
5454
1. Select **Real-time Speech-to-text**.
5555
1. You test speech recognition by uploading an audio file or recording audio with a microphone. For example, select **record audio with a microphone** and then say "Hi Rehaan, this is Jessie from Contoso bank. " Then select the red button to stop recording.
56-
1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jesse", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
56+
1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jessie", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
5757
1. Select **Show advanced options** and turn on **Phrase list**.
5858
1. Enter "Contoso;Jessie;Rehaan" in the phrase list text box. Note that multiple phrases need to be separated by a semicolon.
5959
:::image type="content" source="./media/custom-speech/phrase-list-after-zoom.png" alt-text="Screenshot of a phrase list applied in Speech Studio." lightbox="./media/custom-speech/phrase-list-after-full.png":::
60-
1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jesse", or "Contoso" should be recognized.
60+
1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jessie", or "Contoso" should be recognized.
6161

6262
## Implement phrase list
6363

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
author: eric-urban
3+
ms.service: cognitive-services
4+
ms.topic: include
5+
ms.date: 03/31/2022
6+
ms.author: eur
7+
ms.custom: devx-track-csharp
8+
---
9+
10+
[!INCLUDE [Introduction](intro.md)]
11+
12+
## Speech synchronization
13+
14+
You might want to synchronize transcriptions with an audio track, whether it's done in real time or with a prerecording.
15+
16+
The Speech service returns the offset and duration of the recognized speech.
17+
18+
[!INCLUDE [Define offset and duration](define-offset-duration.md)]
19+
20+
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
21+
22+
### Recognizing offset and duration
23+
24+
You'll want to synchronize captions with the audio track, whether it's done in real time or with a prerecording. With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
25+
26+
For example, run the following command to get the offset and duration of the recognized speech:
27+
28+
```console
29+
spx recognize --file caption.this.mp4 --format any --output each file - @output.each.detailed
30+
```
31+
32+
Since the `@output.each.detailed` argument was set, the output includes the following column headers:
33+
34+
```console
35+
audio.input.id event event.sessionid result.reason result.latency result.text result.json
36+
```
37+
38+
In the `result.json` column, you can find details that include offset and duration for the `Recognizing` and `Recognized` events:
39+
40+
```json
41+
{
42+
"Id": "492574cd8555481a92c22f5ff757ef17",
43+
"RecognitionStatus": "Success",
44+
"DisplayText": "Welcome to applied Mathematics course 201.",
45+
"Offset": 1800000,
46+
"Duration": 30500000
47+
}
48+
```
49+
50+
For more information, see the Speech CLI [datastore configuration](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md) and [output options](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md).
51+
52+
[!INCLUDE [Example offset and duration](example-offset-duration.md)]

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
2525

2626
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
2727

28+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
29+
30+
```cpp
31+
speechRecognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
32+
{
33+
cout << "Recognizing:" << e.Result->Text << std::endl;
34+
cout << "Offset in Ticks:" << e.Result->Offset() << std::endl;
35+
cout << "Duration in Ticks:" << e.Result->Duration() << std::endl;
36+
});
37+
```
38+
2839
### Recognized offset and duration
2940
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
3041

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ speechRecognizer.Recognizing += (object sender, SpeechRecognitionEventArgs e) =>
3232
{
3333
if (e.Result.Reason == ResultReason.RecognizingSpeech)
3434
{
35-
Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");
35+
Console.WriteLine(String.Format ("RECOGNIZING: {0}", e.Result.Text));
3636
Console.WriteLine(String.Format ("Offset in Ticks: {0}", e.Result.OffsetInTicks));
3737
Console.WriteLine(String.Format ("Duration in Ticks: {0}", e.Result.Duration.Ticks));
3838
}

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
2525

2626
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
2727

28+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
29+
30+
```go
31+
func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
32+
defer event.Close()
33+
fmt.Println("Recognizing:", event.Result.Text)
34+
fmt.Println("Offset in Ticks:", event.Result.Offset)
35+
fmt.Println("Duration in Ticks:", event.Result.Duration)
36+
}
37+
```
38+
2839
### Recognized offset and duration
2940
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
3041

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
2525

2626
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
2727

28+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
29+
30+
```java
31+
speechRecognizer.recognizing.addEventListener((s, e) -> {
32+
System.out.println("RECOGNIZING: " + e.getResult().getText());
33+
System.out.println("Offset in Ticks: " + e.getResult().getOffset());
34+
System.out.println("Duration in Ticks: " + e.getResult().getDuration());
35+
});
36+
```
37+
2838
### Recognized offset and duration
2939
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
3040

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
2525

2626
With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
2727

28+
This code snippet shows how to get the offset and duration from a `Recognizing` event.
29+
30+
```javascript
31+
speechRecognizer.recognizing = function (s, e) {
32+
console.log("RECOGNIZING: " + e.result.text);
33+
console.log("Offset in Ticks: " + e.result.offset);
34+
console.log("Duration in Ticks: " + e.result.duration);
35+
};
36+
```
37+
2838
### Recognized offset and duration
2939
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
3040

0 commit comments

Comments
 (0)