Skip to content

Commit 7dd8a3b

Browse files
authored
Merge pull request #210756 from eric-urban/eur/display-text-format
Display text formatting for STT
2 parents 951dae4 + 39df9d1 commit 7dd8a3b

File tree

10 files changed

+187
-114
lines changed

10 files changed

+187
-114
lines changed

articles/cognitive-services/Speech-Service/captioning-concepts.md

Lines changed: 3 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ Captioning can accompany real time or pre-recorded speech. Whether you're showin
4040

4141
The Speech service supports output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.
4242

43+
> [!TIP]
44+
> The Speech service provides [profanity filter](display-text-format.md#profanity-filter) options. You can specify whether to mask, remove, or show profanity.
45+
4346
The [SRT](https://docs.fileformat.com/video/srt/) (SubRip Text) timespan output format is `hh:mm:ss,fff`.
4447

4548
```srt
@@ -161,68 +164,6 @@ RECOGNIZING: Text=welcome to applied mathematics
161164
RECOGNIZED: Text=Welcome to applied Mathematics course 201.
162165
```
163166

164-
## Profanity filter
165-
166-
You can specify whether to mask, remove, or show profanity in recognition results.
167-
168-
> [!NOTE]
169-
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
170-
171-
The profanity filter options are:
172-
- `Masked`: Replaces letters in profane words with asterisk (*) characters. This is the default option.
173-
- `Raw`: Include the profane words verbatim.
174-
- `Removed`: Removes profane words.
175-
176-
For example, to remove profane words from the speech recognition result, set the profanity filter to `Removed` as shown here:
177-
178-
::: zone pivot="programming-language-csharp"
179-
```csharp
180-
speechConfig.SetProfanity(ProfanityOption.Removed);
181-
```
182-
::: zone-end
183-
::: zone pivot="programming-language-cpp"
184-
```cpp
185-
speechConfig->SetProfanity(ProfanityOption::Removed);
186-
```
187-
::: zone-end
188-
::: zone pivot="programming-language-go"
189-
```go
190-
speechConfig.SetProfanity(common.Removed)
191-
```
192-
::: zone-end
193-
::: zone pivot="programming-language-java"
194-
```java
195-
speechConfig.setProfanity(ProfanityOption.Removed);
196-
```
197-
::: zone-end
198-
::: zone pivot="programming-language-javascript"
199-
```javascript
200-
speechConfig.setProfanity(sdk.ProfanityOption.Removed);
201-
```
202-
::: zone-end
203-
::: zone pivot="programming-language-objectivec"
204-
```objective-c
205-
[self.speechConfig setProfanityOptionTo:SPXSpeechConfigProfanityOption.SPXSpeechConfigProfanityOption_ProfanityRemoved];
206-
```
207-
::: zone-end
208-
::: zone pivot="programming-language-swift"
209-
```swift
210-
self.speechConfig!.setProfanityOptionTo(SPXSpeechConfigProfanityOption_ProfanityRemoved)
211-
```
212-
::: zone-end
213-
::: zone pivot="programming-language-python"
214-
```python
215-
speech_config.set_profanity(speechsdk.ProfanityOption.Removed)
216-
```
217-
::: zone-end
218-
::: zone pivot="programming-language-cli"
219-
```console
220-
spx recognize --file caption.this.mp4 --format any --profanity masked --output vtt file - --output srt file -
221-
```
222-
::: zone-end
223-
224-
Profanity filter is applied to the result `Text` and `MaskedNormalizedForm` properties. Profanity filter isn't applied to the result `LexicalForm` and `NormalizedForm` properties. Neither is the filter applied to the word level results.
225-
226167
## Language identification
227168

228169
If the language in the audio could change, use continuous [language identification](language-identification.md). Language identification is used to identify languages spoken in audio when compared against a list of [supported languages](language-support.md?tabs=language-identification). You provide up to 10 candidate languages, at least one of which is expected be in the audio. The Speech service returns the most likely language in the audio.
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: Display text formatting with speech to text - Speech service
3+
titleSuffix: Azure Cognitive Services
4+
description: An overview of key concepts for display text formatting with speech to text.
5+
services: cognitive-services
6+
author: eric-urban
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: speech-service
10+
ms.topic: conceptual
11+
ms.date: 09/19/2022
12+
ms.author: eur
13+
zone_pivot_groups: programming-languages-speech-sdk-cli
14+
---
15+
16+
# Display text formatting with speech to text
17+
18+
Speech-to-text offers an array of formatting features to ensure that the transcribed text is clear and legible. Below is an overview of these features and how each one is used to improve the overall clarity of the final text output.
19+
20+
## ITN
21+
22+
Inverse Text Normalization (ITN) is a process that converts spoken words into their written form. For example, the spoken word "four" is converted to the written form "4". This process is performed by the speech-to-text service and isn't configurable. Some of the supported text formats include dates, times, decimals, currencies, addresses, emails, and phone numbers. You can speak naturally, and the service formats text as expected. The following table shows the ITN rules that are applied to the text output.
23+
24+
|Recognized speech|Display text|
25+
|---|---|
26+
|`that will cost nine hundred dollars`|`That will cost $900.`|
27+
|`my phone number is one eight hundred, four five six, eight nine ten`|`My phone number is 1-800-456-8910.`|
28+
|`the time is six forty five p m`|`The time is 6:45 PM.`|
29+
|`I live on thirty five lexington avenue`|`I live on 35 Lexington Ave.`|
30+
|`the answer is six point five`|`The answer is 6.5.`|
31+
|`send it to support at help dot com`|`Send it to [email protected].`|
32+
33+
## Capitalization
34+
35+
Speech-to-text models recognize words that should be capitalized to improve readability, accuracy, and grammar. For example, the Speech service will automatically capitalize proper nouns and words at the beginning of a sentence. Some examples are shown in this table.
36+
37+
|Recognized speech|Display text|
38+
|---|---|
39+
|`i got an x l t shirt`|`I got an XL t-shirt.`|
40+
|`my name is jennifer smith`|`My name is Jennifer Smith.`|
41+
|`i want to visit new york city`|`I want to visit New York City.`|
42+
43+
## Disfluency removal
44+
45+
When speaking, it's common for someone to stutter, duplicate words, and say filler words like "uhm" or "uh". Speech-to-text can recognize such disfluencies and remove them from the display text. Disfluency removal is great for transcribing live unscripted speeches to read them back later. Some examples are shown in this table.
46+
47+
|Recognized speech|Display text|
48+
|---|---|
49+
|`i uh said that we can go to the uhmm movies`|`I said that we can go to the movies.`|
50+
|`its its not that big of uhm a deal`|`It's not that big of a deal.`|
51+
|`umm i think tomorrow should work`|`I think tomorrow should work.`|
52+
53+
## Punctuation
54+
55+
Speech-to-text automatically punctuates your text to improve clarity. Punctuation is helpful for reading back call or conversation transcriptions. Some examples are shown in this table.
56+
57+
|Recognized speech|Display text|
58+
|---|---|
59+
|`how are you`|`How are you?`|
60+
|`we can go to the mall park or beach`|`We can go to the mall, park, or beach.`|
61+
62+
When you're using speech-to-text with continuous recognition, you can configure the Speech service to recognize explicit punctuation marks. Then you can speak punctuation aloud in order to make your text more legible. This is especially useful in a situation where you want to use complex punctuation without having to merge it later. Some examples are shown in this table.
63+
64+
|Recognized speech|Display text|
65+
|---|---|
66+
|`they entered the room dot dot dot`|`They entered the room...`|
67+
|`i heart emoji you period`|`I <3 you.`|
68+
|`the options are apple forward slash banana forward slash orange period`|`The options are apple/banana/orange.`|
69+
|`are you sure question mark`|`Are you sure?`|
70+
71+
Use the Speech SDK to enable dictation mode when you're using speech-to-text with continuous recognition. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation.
72+
73+
::: zone pivot="programming-language-csharp"
74+
```csharp
75+
speechConfig.EnableDictation();
76+
```
77+
::: zone-end
78+
::: zone pivot="programming-language-cpp"
79+
```cpp
80+
speechConfig->EnableDictation();
81+
```
82+
::: zone-end
83+
::: zone pivot="programming-language-go"
84+
```go
85+
speechConfig.EnableDictation()
86+
```
87+
::: zone-end
88+
::: zone pivot="programming-language-java"
89+
```java
90+
speechConfig.enableDictation();
91+
```
92+
::: zone-end
93+
::: zone pivot="programming-language-javascript"
94+
```javascript
95+
speechConfig.enableDictation();
96+
```
97+
::: zone-end
98+
::: zone pivot="programming-language-objectivec"
99+
```objective-c
100+
[self.speechConfig enableDictation];
101+
```
102+
::: zone-end
103+
::: zone pivot="programming-language-swift"
104+
```swift
105+
self.speechConfig!.enableDictation()
106+
```
107+
::: zone-end
108+
::: zone pivot="programming-language-python"
109+
```python
110+
speech_config.enable_dictation()
111+
```
112+
::: zone-end
113+
114+
## Profanity filter
115+
116+
You can specify whether to mask, remove, or show profanity in the final transcribed text. Masking replaces profane words with asterisk (*) characters so that you can keep the original sentiment of your text while making it more appropriate for certain situations
117+
118+
> [!NOTE]
119+
> Microsoft also reserves the right to mask or remove any word that is deemed inappropriate. Such words will not be returned by the Speech service, whether or not you enabled profanity filtering.
120+
121+
The profanity filter options are:
122+
- `Masked`: Replaces letters in profane words with asterisk (*) characters. Masked is the default option.
123+
- `Raw`: Include the profane words verbatim.
124+
- `Removed`: Removes profane words.
125+
126+
For example, to remove profane words from the speech recognition result, set the profanity filter to `Removed` as shown here:
127+
128+
::: zone pivot="programming-language-csharp"
129+
```csharp
130+
speechConfig.SetProfanity(ProfanityOption.Removed);
131+
```
132+
::: zone-end
133+
::: zone pivot="programming-language-cpp"
134+
```cpp
135+
speechConfig->SetProfanity(ProfanityOption::Removed);
136+
```
137+
::: zone-end
138+
::: zone pivot="programming-language-go"
139+
```go
140+
speechConfig.SetProfanity(common.Removed)
141+
```
142+
::: zone-end
143+
::: zone pivot="programming-language-java"
144+
```java
145+
speechConfig.setProfanity(ProfanityOption.Removed);
146+
```
147+
::: zone-end
148+
::: zone pivot="programming-language-javascript"
149+
```javascript
150+
speechConfig.setProfanity(sdk.ProfanityOption.Removed);
151+
```
152+
::: zone-end
153+
::: zone pivot="programming-language-objectivec"
154+
```objective-c
155+
[self.speechConfig setProfanityOptionTo:SPXSpeechConfigProfanityOption.SPXSpeechConfigProfanityOption_ProfanityRemoved];
156+
```
157+
::: zone-end
158+
::: zone pivot="programming-language-swift"
159+
```swift
160+
self.speechConfig!.setProfanityOptionTo(SPXSpeechConfigProfanityOption_ProfanityRemoved)
161+
```
162+
::: zone-end
163+
::: zone pivot="programming-language-python"
164+
```python
165+
speech_config.set_profanity(speechsdk.ProfanityOption.Removed)
166+
```
167+
::: zone-end
168+
::: zone pivot="programming-language-cli"
169+
```console
170+
spx recognize --file caption.this.mp4 --format any --profanity masked --output vtt file - --output srt file -
171+
```
172+
::: zone-end
173+
174+
Profanity filter is applied to the result `Text` and `MaskedNormalizedForm` properties. Profanity filter isn't applied to the result `LexicalForm` and `NormalizedForm` properties. Neither is the filter applied to the word level results.
175+
176+
177+
## Next steps
178+
179+
* [Speech-to-text quickstart](get-started-speech-to-text.md)
180+
* [Get speech recognition results](get-speech-recognition-results.md)

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech/cpp.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -183,16 +183,6 @@ recognitionEnd.get_future().get();
183183
recognizer->StopContinuousRecognitionAsync().get();
184184
```
185185

186-
## Dictation mode
187-
188-
When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".
189-
190-
To enable dictation mode, use the [`EnableDictation`](/cpp/cognitive-services/speech/speechconfig#enabledictation) method on [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig):
191-
192-
```cpp
193-
config->EnableDictation();
194-
```
195-
196186
## Change the source language
197187

198188
A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to German. In your code, find your [`SpeechConfig`](/cpp/cognitive-services/speech/speechconfig) instance and add this line directly below it:

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech/csharp.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -258,16 +258,6 @@ Task.WaitAny(new[] { stopRecognition.Task });
258258
// await recognizer.StopContinuousRecognitionAsync();
259259
```
260260

261-
### Dictation mode
262-
263-
When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".
264-
265-
To enable dictation mode, use the [`EnableDictation`](/dotnet/api/microsoft.cognitiveservices.speech.speechconfig.enabledictation) method on [`SpeechConfig`](/dotnet/api/microsoft.cognitiveservices.speech.speechconfig):
266-
267-
```csharp
268-
speechConfig.EnableDictation();
269-
```
270-
271261
## Change the source language
272262

273263
A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your [`SpeechConfig`](/dotnet/api/microsoft.cognitiveservices.speech.speechconfig) instance and add this line directly below it:

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech/java.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -201,16 +201,6 @@ stopTranslationWithFileSemaphore.acquire();
201201
recognizer.stopContinuousRecognitionAsync().get();
202202
```
203203

204-
### Dictation mode
205-
206-
When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".
207-
208-
To enable dictation mode, use the [`enableDictation`](/java/api/com.microsoft.cognitiveservices.speech.speechconfig.enabledictation) method on [`SpeechConfig`](/java/api/com.microsoft.cognitiveservices.speech.speechconfig):
209-
210-
```java
211-
config.enableDictation();
212-
```
213-
214204
## Change the source language
215205

216206
A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to French. In your code, find your [`SpeechConfig`](/java/api/com.microsoft.cognitiveservices.speech.speechconfig) instance and add this line directly below it:

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech/javascript.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -181,16 +181,6 @@ recognizer.startContinuousRecognitionAsync();
181181
// recognizer.stopContinuousRecognitionAsync();
182182
```
183183

184-
### Dictation mode
185-
186-
When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".
187-
188-
To enable dictation mode, use the [`enableDictation`](/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig#enabledictation--) method on [`SpeechConfig`](/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig):
189-
190-
```javascript
191-
speechConfig.enableDictation();
192-
```
193-
194184
## Change the source language
195185

196186
A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your [`SpeechConfig`](/javascript/api/microsoft-cognitiveservices-speech-sdk/speechconfig) instance and add this line directly below it:

articles/cognitive-services/Speech-Service/includes/how-to/recognize-speech/python.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -148,16 +148,6 @@ while not done:
148148
time.sleep(.5)
149149
```
150150

151-
### Dictation mode
152-
153-
When you're using continuous recognition, you can enable dictation processing by using the corresponding function. This mode will cause the speech configuration instance to interpret word descriptions of sentence structures such as punctuation. For example, the utterance "Do you live in town question mark" would be interpreted as the text "Do you live in town?".
154-
155-
To enable dictation mode, use the [`enable_dictation()`](/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig#enable-dictation--) method on [`SpeechConfig`](/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig):
156-
157-
```Python
158-
SpeechConfig.enable_dictation()
159-
```
160-
161151
## Change the source language
162152

163153
A common task for speech recognition is specifying the input (or source) language. The following example shows how you would change the input language to German. In your code, find your `SpeechConfig` instance and add this line directly below it:

articles/cognitive-services/Speech-Service/includes/quickstarts/captioning/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Here are details about the optional arguments from the previous command:
5454
- `--output vtt file -` and `--output srt file -`: Outputs WebVTT and SRT captions to standard output. For more information about SRT and WebVTT caption file formats, see [Caption output format](~/articles/cognitive-services/speech-service/captioning-concepts.md#caption-output-format). For more information about the `--output` argument, see [Speech CLI output options](~/articles/cognitive-services/speech-service/spx-output-options.md).
5555
- `@output.each.detailed`: Outputs event results with text, offset, and duration. For more information, see [Get speech recognition results](~/articles/cognitive-services/speech-service/get-speech-recognition-results.md).
5656
- `--property SpeechServiceResponse_StablePartialResultThreshold=5`: You can request that the Speech service return fewer `Recognizing` events that are more accurate. In this example, the Speech service must affirm recognition of a word at least five times before returning the partial results to you. For more information, see [Get partial results](~/articles/cognitive-services/speech-service/captioning-concepts.md#get-partial-results) concepts.
57-
- `--profanity masked`: You can specify whether to mask, remove, or show profanity in recognition results. For more information, see [Profanity filter](~/articles/cognitive-services/speech-service/captioning-concepts.md#profanity-filter) concepts.
57+
- `--profanity masked`: You can specify whether to mask, remove, or show profanity in recognition results. For more information, see [Profanity filter](~/articles/cognitive-services/speech-service/display-text-format.md#profanity-filter) concepts.
5858
- `--phrases "Constoso;Jessie;Rehaan"`: You can specify a list of phrases to be recognized, such as Contoso, Jessie, and Rehaan. For more information, see [Improve recognition with phrase list](~/articles/cognitive-services/speech-service/improve-accuracy-phrase-list.md).
5959

6060
## Clean up resources

articles/cognitive-services/Speech-Service/includes/quickstarts/captioning/usage-arguments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,5 +34,5 @@ Output options include:
3434
- `--output FILE`: Output captions to the specified `file`. This flag is required.
3535
- `--srt`: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see [Caption output format](~/articles/cognitive-services/speech-service/captioning-concepts.md#caption-output-format).
3636
- `--quiet`: Suppress console output, except errors.
37-
- `--profanity OPTION`: Valid values: raw, remove, mask. For more information, see [Profanity filter](~/articles/cognitive-services/speech-service/captioning-concepts.md#profanity-filter) concepts.
37+
- `--profanity OPTION`: Valid values: raw, remove, mask. For more information, see [Profanity filter](~/articles/cognitive-services/speech-service/display-text-format.md#profanity-filter) concepts.
3838
- `--threshold NUMBER`: Set stable partial result threshold. The default value with this code example is `3`. For more information, see [Get partial results](~/articles/cognitive-services/speech-service/captioning-concepts.md#get-partial-results) concepts.

articles/cognitive-services/Speech-Service/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ items:
5858
href: how-to-recognize-speech.md
5959
- name: Get speech recognition results
6060
href: get-speech-recognition-results.md
61+
- name: Display text formatting
62+
href: display-text-format.md
6163
- name: How to use batch transcription
6264
href: batch-transcription.md
6365
- name: Improve recognition with Custom Speech

0 commit comments

Comments
 (0)