MicrosoftDocs
diff --git a/‎articles/cognitive-services/Speech-Service/captioning-concepts.md
Lines changed: 40 additions & 5 deletions b/‎articles/cognitive-services/Speech-Service/captioning-concepts.md
Lines changed: 40 additions & 5 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/captioning-quickstart.md
Lines changed: 58 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/captioning-quickstart.md
Lines changed: 58 additions & 0 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/get-speech-recognition-results.md
Lines changed: 5 additions & 1 deletion b/‎articles/cognitive-services/Speech-Service/get-speech-recognition-results.md
Lines changed: 5 additions & 1 deletion
diff --git a/‎articles/cognitive-services/Speech-Service/improve-accuracy-phrase-list.md
Lines changed: 2 additions & 2 deletions b/‎articles/cognitive-services/Speech-Service/improve-accuracy-phrase-list.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cli.md
Lines changed: 52 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cli.md
Lines changed: 52 additions & 0 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
Lines changed: 11 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/cpp.md
Lines changed: 11 additions & 0 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md
Lines changed: 1 addition & 1 deletion b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/csharp.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
Lines changed: 11 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/go.md
Lines changed: 11 additions & 0 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
Lines changed: 10 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/java.md
Lines changed: 10 additions & 0 deletions
diff --git a/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
Lines changed: 10 additions & 0 deletions b/‎articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech-results/javascript.md
Lines changed: 10 additions & 0 deletions
@@ -10,7 +10,7 @@ ms.subservice: speech-service
 ms.topic: conceptual
 ms.date: 04/12/2022
 ms.author: eur
-zone_pivot_groups: programming-languages-speech-sdk
+zone_pivot_groups: programming-languages-speech-sdk-cli
 ---
 
 # Captioning with speech to text
@@ -27,12 +27,37 @@ The following are aspects to consider when using captioning:
 * Center captions horizontally on the screen, in a large and prominent font. 
 * Consider whether to use partial results, when to start displaying captions, and how many words to show at a time. 
 * Learn about captioning protocols such as [SMPTE-TT](https://ieeexplore.ieee.org/document/7291854). 
-* Consider output formats such as SRT (SubRip Subtitle) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
+* Consider output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
 
 > [!TIP]
 > Try the [Azure Video Indexer](/azure/azure-video-indexer/video-indexer-overview) as a demonstration of how you can get captions for videos that you upload. 
 
-Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video. 
+Captioning can accompany real time or pre-recorded speech. Whether you're showing captions in real time or with a recording, you can use the [Speech SDK](speech-sdk.md) or [Speech CLI](spx-overview.md) to recognize speech and get transcriptions. You can also use the [Batch transcription API](batch-transcription.md) for pre-recorded video. 
+
+## Caption output format
+
+The Speech service supports output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
+
+The [SRT](https://docs.fileformat.com/video/srt/) (SubRip Text) timespan output format is `hh:mm:ss,fff`. 
+
+```srt
+1
+00:00:00,180 --> 00:00:03,230
+Welcome to applied Mathematics course 201.
+```
+
+The [WebVTT](https://www.w3.org/TR/webvtt1/#introduction) (Web Video Text Tracks) timespan output format is `hh:mm:ss,fff`. 
+
+```
+WEBVTT
+
+00:00:00.180 --> 00:00:03.230
+Welcome to applied Mathematics course 201.
+{
+  "ResultId": "8e89437b4b9349088a933f8db4ccc263",
+  "Duration": "00:00:03.0500000"
+}
+```
 
 ## Input audio to the Speech service
 
@@ -61,7 +86,7 @@ For captioning of prerecorded speech or wherever latency isn't a concern, you co
 
 Real time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each `Recognizing` event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the `Recognized` event. There's also some middle ground, which is referred to as "stable partial results". 
 
-You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
+You can request that the Speech service return fewer `Recognizing` events that are more accurate. This is done by setting the `SpeechServiceResponse_StablePartialResultThreshold` property to a value between `0` and `2147483647`. The value that you set is the number of times a word has to be recognized before the Speech service returns a `Recognizing` event. For example, if you set the `SpeechServiceResponse_StablePartialResultThreshold` property value to `5`, the Speech service will affirm recognition of a word at least five times before returning the partial results to you with a `Recognizing` event.
 
 ::: zone pivot="programming-language-csharp"
 ```csharp
@@ -103,6 +128,11 @@ self.speechConfig!.setPropertyTo(5, by: SPXPropertyId.speechServiceResponseStabl
 speech_config.set_property(property_id = speechsdk.PropertyId.SpeechServiceResponse_StablePartialResultThreshold, value = 5)
 ```
 ::: zone-end
+::: zone pivot="programming-language-cli"
+```console
+spx recognize --file caption.this.mp4 --format any --property SpeechServiceResponse_StablePartialResultThreshold=5 --output vtt file - --output srt file -
+```
+::: zone-end
 
 Requesting more stable partial results will reduce the "flickering" or changing text, but it can increase latency as you wait for higher confidence results. 
 
@@ -183,6 +213,11 @@ self.speechConfig!.setProfanityOptionTo(SPXSpeechConfigProfanityOption_Profanity
 speech_config.set_profanity(speechsdk.ProfanityOption.Removed)
 ```
 ::: zone-end
+::: zone pivot="programming-language-cli"
+```console
+spx recognize --file caption.this.mp4 --format any --profanity masked --output vtt file - --output srt file -
+```
+::: zone-end
 
 Profanity filter is applied to the result `Text` and `MaskedNormalizedForm` properties. Profanity filter isn't applied to the result `LexicalForm` and `NormalizedForm` properties. Neither is the filter applied to the word level results.
 
@@ -204,5 +239,5 @@ There are some situations where [training a custom model](custom-speech-overview
 
 ## Next steps
 
-* [Get started with speech to text](get-started-speech-to-text.md)
+* [Captioning quickstart](captioning-quickstart.md)
 * [Get speech recognition results](get-speech-recognition-results.md)
@@ -0,0 +1,58 @@
+---
+title: "Create captions with speech to text quickstart - Speech service"
+titleSuffix: Azure Cognitive Services
+description: In this quickstart, you convert speech to text as captions.
+services: cognitive-services
+author: eric-urban
+manager: nitinme
+ms.service: cognitive-services
+ms.subservice: speech-service
+ms.topic: quickstart
+ms.date: 04/23/2022
+ms.author: eur
+ms.devlang: cpp, csharp
+zone_pivot_groups: programming-languages-speech-sdk-cli
+---
+
+# Quickstart: Create captions with speech to text
+
+::: zone pivot="programming-language-csharp"
+[!INCLUDE [C# include](includes/quickstarts/captioning/csharp.md)]
+::: zone-end
+
+::: zone pivot="programming-language-cpp"
+[!INCLUDE [C++ include](includes/quickstarts/captioning/cpp.md)]
+::: zone-end
+
+::: zone pivot="programming-language-go"
+[!INCLUDE [Go include](includes/quickstarts/captioning/go.md)]
+::: zone-end
+
+::: zone pivot="programming-language-java"
+[!INCLUDE [Java include](includes/quickstarts/captioning/java.md)]
+::: zone-end
+
+::: zone pivot="programming-language-javascript"
+[!INCLUDE [JavaScript include](includes/quickstarts/captioning/javascript.md)]
+::: zone-end
+
+::: zone pivot="programming-language-objectivec"
+[!INCLUDE [ObjectiveC include](includes/quickstarts/captioning/objectivec.md)]
+::: zone-end
+
+::: zone pivot="programming-language-swift"
+[!INCLUDE [Swift include](includes/quickstarts/captioning/swift.md)]
+::: zone-end
+
+::: zone pivot="programming-language-python"
+[!INCLUDE [Python include](./includes/quickstarts/captioning/python.md)]
+::: zone-end
+
+::: zone pivot="programming-language-cli"
+[!INCLUDE [CLI include](includes/quickstarts/captioning/cli.md)]
+::: zone-end
+
+## Next steps
+
+> [!div class="nextstepaction"]
+> [Learn more about speech recognition](how-to-recognize-speech.md)
@@ -11,7 +11,7 @@ ms.topic: how-to
 ms.date: 03/31/2022
 ms.author: eur
 ms.devlang: cpp, csharp, golang, java, javascript, objective-c, python
-zone_pivot_groups: programming-languages-speech-sdk
+zone_pivot_groups: programming-languages-speech-sdk-cli
 keywords: speech to text, speech to text software
 ---
 
@@ -49,6 +49,10 @@ keywords: speech to text, speech to text software
 [!INCLUDE [Python include](./includes/how-to/recognize-speech-results/python.md)]
 ::: zone-end
 
+::: zone pivot="programming-language-cli"
+[!INCLUDE [CLI include](./includes/how-to/recognize-speech-results/cli.md)]
+::: zone-end
+
 ## Next steps
 
 * [Try the speech to text quickstart](get-started-speech-to-text.md)
 
@@ -53,11 +53,11 @@ Now try Speech Studio to see how phrase list can improve recognition accuracy.
 1. Sign in to [Speech Studio](https://speech.microsoft.com/). 
 1. Select **Real-time Speech-to-text**.
 1. You test speech recognition by uploading an audio file or recording audio with a microphone. For example, select **record audio with a microphone** and then say "Hi Rehaan, this is Jessie from Contoso bank. " Then select the red button to stop recording. 
-1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jesse", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
+1. You should see the transcription result in the **Test results** text box. If "Rehaan", "Jessie", or "Contoso" were recognized incorrectly, you can add the terms to a phrase list in the next step.
 1. Select **Show advanced options** and turn on **Phrase list**. 
 1. Enter "Contoso;Jessie;Rehaan" in the phrase list text box. Note that multiple phrases need to be separated by a semicolon.
     :::image type="content" source="./media/custom-speech/phrase-list-after-zoom.png" alt-text="Screenshot of a phrase list applied in Speech Studio." lightbox="./media/custom-speech/phrase-list-after-full.png":::
-1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jesse", or "Contoso" should be recognized. 
+1. Use the microphone to test recognition again. Otherwise you can select the retry arrow next to your audio file to re-run your audio. The terms "Rehaan", "Jessie", or "Contoso" should be recognized. 
 
 ## Implement phrase list
 
 
@@ -0,0 +1,52 @@
+---
+author: eric-urban
+ms.service: cognitive-services
+ms.topic: include
+ms.date: 03/31/2022
+ms.author: eur
+ms.custom: devx-track-csharp
+---
+
+[!INCLUDE [Introduction](intro.md)]
+
+## Speech synchronization 
+
+You might want to synchronize transcriptions with an audio track, whether it's done in real time or with a prerecording. 
+
+The Speech service returns the offset and duration of the recognized speech. 
+
+[!INCLUDE [Define offset and duration](define-offset-duration.md)]
+
+The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
+
+### Recognizing offset and duration
+
+You'll want to synchronize captions with the audio track, whether it's done in real time or with a prerecording. With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
+
+For example, run the following command to get the offset and duration of the recognized speech:
+
+```console
+spx recognize --file caption.this.mp4 --format any --output each file - @output.each.detailed
+```
+
+Since the `@output.each.detailed` argument was set, the output includes the following column headers:
+
+```console
+audio.input.id  event   event.sessionid result.reason   result.latency  result.text     result.json
+```
+
+In the `result.json` column, you can find details that include offset and duration for the `Recognizing` and `Recognized` events:
+
+```json
+{
+	"Id": "492574cd8555481a92c22f5ff757ef17",
+	"RecognitionStatus": "Success",
+	"DisplayText": "Welcome to applied Mathematics course 201.",
+	"Offset": 1800000,
+	"Duration": 30500000
+}
+```
+
+For more information, see the Speech CLI [datastore configuration](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md) and [output options](~/articles/cognitive-services/speech-service/spx-data-store-configuration.md). 
+
+[!INCLUDE [Example offset and duration](example-offset-duration.md)]
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
 
 With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
 
+This code snippet shows how to get the offset and duration from a `Recognizing` event. 
+
+```cpp
+speechRecognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
+    {
+        cout << "Recognizing:" << e.Result->Text << std::endl;
+        cout << "Offset in Ticks:" << e.Result->Offset() << std::endl;
+        cout << "Duration in Ticks:" << e.Result->Duration() << std::endl;
+    });
+```
+
 ### Recognized offset and duration
 Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
 
 
@@ -32,7 +32,7 @@ speechRecognizer.Recognizing += (object sender, SpeechRecognitionEventArgs e) =>
     {
         if (e.Result.Reason == ResultReason.RecognizingSpeech)
         {        
-            Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");
+            Console.WriteLine(String.Format ("RECOGNIZING: {0}", e.Result.Text));
             Console.WriteLine(String.Format ("Offset in Ticks: {0}", e.Result.OffsetInTicks));
             Console.WriteLine(String.Format ("Duration in Ticks: {0}", e.Result.Duration.Ticks));
         }
 
@@ -25,6 +25,17 @@ The end of a single utterance is determined by listening for silence at the end.
 
 With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
 
+This code snippet shows how to get the offset and duration from a `Recognizing` event. 
+
+```go
+func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
+    defer event.Close()
+    fmt.Println("Recognizing:", event.Result.Text)
+    fmt.Println("Offset in Ticks:", event.Result.Offset)
+    fmt.Println("Duration in Ticks:", event.Result.Duration)
+}
+```
+
 ### Recognized offset and duration
 Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
 
 
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
 
 With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
 
+This code snippet shows how to get the offset and duration from a `Recognizing` event. 
+
+```java
+speechRecognizer.recognizing.addEventListener((s, e) -> {
+    System.out.println("RECOGNIZING: " + e.getResult().getText());
+    System.out.println("Offset in Ticks: " + e.getResult().getOffset());
+    System.out.println("Duration in Ticks: " + e.getResult().getDuration());
+});
+```
+
 ### Recognized offset and duration
 Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
 
 
@@ -25,6 +25,16 @@ The end of a single utterance is determined by listening for silence at the end.
 
 With the `Recognizing` event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each `Recognizing` event comes with a textual estimate of the speech recognized so far.
 
+This code snippet shows how to get the offset and duration from a `Recognizing` event. 
+
+```javascript
+speechRecognizer.recognizing = function (s, e) {
+    console.log("RECOGNIZING: " + e.result.text);
+    console.log("Offset in Ticks: " + e.result.offset);
+    console.log("Duration in Ticks: " + e.result.duration);
+};
+```
+
 ### Recognized offset and duration
 Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the `Recognized` event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding `SpeechConfig` property as shown here:
Original file line number	Diff line number	Diff line change
`@@ -32,7 +32,7 @@ speechRecognizer.Recognizing += (object sender, SpeechRecognitionEventArgs e) =>`
`32`	`32`	`{`
`33`	`33`	`if (e.Result.Reason == ResultReason.RecognizingSpeech)`
`34`	`34`	`{`
`35`		`- Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");`
	`35`	`+ Console.WriteLine(String.Format ("RECOGNIZING: {0}", e.Result.Text));`
`36`	`36`	`Console.WriteLine(String.Format ("Offset in Ticks: {0}", e.Result.OffsetInTicks));`
`37`	`37`	`Console.WriteLine(String.Format ("Duration in Ticks: {0}", e.Result.Duration.Ticks));`
`38`	`38`	`}`