Skip to content

Commit dd7194b

Browse files
Merge pull request #261235 from eric-urban/eur/display-text-format
custom display text format
2 parents 3e23b47 + 14de8a5 commit dd7194b

File tree

14 files changed

+344
-9
lines changed

14 files changed

+344
-9
lines changed
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
---
2+
title: "Display text format training data - Speech service"
3+
titleSuffix: Azure AI services
4+
description: Learn about how to prepare display text format training data for custom speech.
5+
author: eric-urban
6+
manager: nitinme
7+
ms.service: azure-ai-speech
8+
ms.topic: how-to
9+
ms.date: 12/14/2023
10+
ms.author: eur
11+
---
12+
13+
# How to prepare display text format training data for custom speech
14+
15+
Azure AI Speech service can be viewed as two components: speech recognition and display text formatting. Speech recognition transcribes audio to lexical text, and then the lexical text is transformed to display text.
16+
17+
:::image type="content" source="./media/custom-speech/speech-recognition-to-display-text.jpg" alt-text="Diagram of the flow of speech recognition to lexical to display text." lightbox="./media/custom-speech/speech-recognition-to-display-text.jpg":::
18+
19+
These are the locales that support the display text format feature: da-DK, de-DE, en-AU, en-CA, en-GB, en-HK, en-IE, en-IN, en-NG, en-NZ, en-PH, en-SG, en-US, es-ES, es-MX, fi-FI, fr-CA, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, nb-NO, nl-NL, pl-PL, pt-BR, pt-PT, sv-SE, tr-TR, zh-CN, zh-HK.
20+
21+
## Default display text formatting
22+
23+
The display text pipeline is composed by a sequence of display format builders. Each builder corresponds to a display format task such as ITN, capitalization, and profanity filtering.
24+
25+
- **Inverse Text Normalization (ITN)** - To convert the text of spoken form numbers to display form. For example: `"I spend twenty dollars" -> "I spend $20"`
26+
- **Capitalization** - To upper case entity names, acronyms, or the first letter of a sentence. For example: `"she is from microsoft" -> "She is from Microsoft"`
27+
- **Profanity filtering** - Masking or removal of profanity words from a sentence. For example, assuming "abcd" is a profanity word, then the word will be masked by profanity masking: `"I never say abcd" -> "I never say ****"`
28+
29+
The base builders of the display text pipeline are maintained by Microsoft for the general purpose display processing tasks. You get the base builders by default when you use the Speech service. For more information about out-of-the-box formatting, see [Display text format](./display-text-format.md).
30+
31+
## Custom display text formatting
32+
33+
Beside the base builders maintained by Microsoft for the general purpose display processing tasks, you can define custom display text formatting rules to customize the display text formatting pipeline for your specific scenarios. The custom display text formatting rules are defined in a custom display text formatting file.
34+
35+
- [Custom ITN](#custom-itn) - Extend the functionalities of base ITN, by applying a rule based custom ITN model from customer.
36+
- [Custom rewrite](#custom-rewrite) - Rewrite one phrase to another based on a rule based model from customer.
37+
- [Custom profanity filtering](#custom-profanity) - Perform profanity handling based on the profanity word list from customer.
38+
39+
The order of the display text formatting pipeline is illustrated in this diagram.
40+
41+
:::image type="content" source="./media/custom-speech/display-text-pipeline.jpg" alt-text="Diagram of the display format builders." lightbox="./media/custom-speech/display-text-pipeline.jpg":::
42+
43+
## Custom ITN
44+
45+
The philosophy of pattern-based custom ITN is that you can specify the final output that you want to see. The Speech service figures out how the words might be spoken and map the predicted spoken expressions to the specified output format.
46+
47+
A custom ITN model is built from a set of ITN rules. An ITN rule is a regular expression like pattern string, which describes:
48+
49+
* A matching pattern of the input string
50+
* The desired format of the output string
51+
52+
The default ITN rules provided by Microsoft are applied first. The output of the default ITN model is used as the input of the custom ITN model. The matching algorithm inside the custom ITN model is case-insensitive.
53+
54+
There are four categories of pattern matching with custom ITN rules.
55+
- [Patterns with literals](#patterns-with-literals)
56+
- [Patterns with wildcards](#patterns-with-wildcards)
57+
- [Patterns with Regex-style Notation](#patterns-with-regex-style-notation)
58+
- [Patterns with explicit replacement](#patterns-with-explicit-replacement)
59+
60+
### Patterns with literals
61+
62+
For example, a developer might have an item (such as a product) named with the alphanumeric form `JO:500`. The job of our system will be to figure out that users might say the letter part as `J O`, or they might say `joe`, and the number part as `five hundred` or `five zero zero` or `five oh oh` or `five double zero`, and then build a model that maps all of these possibilities back to `JO:500` (including inserting the colon).
63+
64+
Patterns can be applied in parallel by specifying one rule per line in the display text formatting file. Here is an example of a display text formatting file that specifies two rules:
65+
66+
```text
67+
JO:500
68+
MM:760
69+
```
70+
71+
### Patterns with wildcards
72+
73+
Suppose a customer needs to refer to a whole series of alphanumeric items named `JO:500`, `JO:600`, `JO:700`, etc. We can support this without requiring spelling out all possibilities in several ways.
74+
75+
Character ranges can be specified with the notation `[...]`, so `JO:[5-7]00` is equivalent to writing out three patterns.
76+
77+
There's also a set of wildcard items that can be used. One of these is `\d`, which means any digit. So `JO:\d00` covers `JO:000`, `JO:100`, and others up to `JO:900`.
78+
79+
Like a regular expression, there are several predefined character classes for an ITN rule:
80+
81+
* `\d` - match a digit from '0' to '9', and output it directly
82+
* `\l` - match a letter (case-insensitive) and transduce it to lower case
83+
* `\u` - match a letter (case-insensitive) and transduce it to upper case
84+
* `\a` - match a letter (case-insensitive) and output it directly
85+
86+
There are also escape expressions for referring to characters that otherwise have special syntactic meaning:
87+
88+
* `\\` - match and output the char `\`
89+
* `\(` and `\)`
90+
* `\{` and `\}`
91+
* `\|`
92+
* `\+` and `\?` and `\*`
93+
94+
### Patterns with regex-style notation
95+
96+
To enhance the flexibility of pattern writing, regular expression-like constructions of phrases with alternatives and Kleene-closure are supported.
97+
98+
* A phrase is indicated with parentheses, like `(...)` - The parentheses don't literally count as characters to be matched.
99+
* You can indicate alternatives within a phrase with the `|` character such as `(AB|CDE)`.
100+
* You can suffix a phrase with `?` to indicate that it's optional, `+` to indicate that it can be repeated, or `*` to indicate both. You can only suffix phrases with these characters and not individual characters (which is more restrictive than most regular expression implementations).
101+
102+
A pattern such as `(AB|CD)-(\d)+` would represent constructs like "AB-9" or "CD-22" and be expanded to spoken words like `A B nine` and `C D twenty two` (or `C D two two`).
103+
104+
### Patterns with explicit replacement
105+
106+
The general philosophy is "you show us what the output should look like, and the Speech service figures out how people say it." But this doesn't always work because some scenarios might have quirky unpredictable ways of saying things, or the Speech service background rules might have gaps. For example, there can be colloquial pronunciations for initials and acronyms--`ZPI` might be spoken as `zippy`. In this case a pattern like `ZPI-\d\d` is unlikely to work if a user says `zippy twenty two`. For this sort of situation, there's a display text format notation `{spoken>written}`. This particular case could be written out `{zippy>ZPI}-\d\d`.
107+
108+
This can be useful for handling things that the Speech mapping rules but don't yet support. For example you might write a pattern `\d0-\d0` expecting the system to understand that "-" can mean a range, and should be pronounced `to`, as in `twenty to thirty`. But perhaps it doesn't. So you can write a more explicit pattern like `\d0{to>-}\d0` and tell it how you expect the dash to be read.
109+
110+
You can also leave out the `>` and following written form to indicate words that should be recognized but ignored. So a pattern like `{write} (\u.)+` recognizes `write A B C` and output `A.B.C`--dropping the `write` part.
111+
112+
### Custom ITN Examples
113+
114+
#### Group digits
115+
116+
To group 6 digits into two groups and add a '-' character between them:
117+
118+
> ITN rule: `\d\d\d-\d\d\d`
119+
Sample: `"cadence one oh five one fifteen" -> "cadence 105-115"`
120+
121+
#### Format a film name
122+
123+
*Space: 1999* is a famous film, to support it:
124+
125+
> ITN rule: `Space: 1999`
126+
Sample: `"watching space nineteen ninety nine" -> "watching Space: 1999"`
127+
128+
#### Pattern with Replacement
129+
130+
> ITN rule: `\d[05]{ to >-}\d[05]`
131+
Sample: `fifteen to twenty -> 15-20`
132+
133+
## Custom rewrite
134+
135+
General speaking, for an input string, rewrite model tries to replace the `original phrase` in the input string with the corresponding `new phrase` for each rewrite rule. A rewrite model is a collection of rewrite rules.
136+
137+
* A rewrite rule is a pair of two phrases: the original phrase and a new phrase.
138+
* The two phrases are separated by a TAB character. For example, `original phrase`{TAB}`new phrase`.
139+
* The original phrase is matched (case-insensitive) and replaced with the new phrase (case-sensitive). [Grammar punctuation characters](#grammar-punctuation) in the original phrase are ignored during match.
140+
* If any rewrite rules conflict, the one with the longer `original phrase` is used as the match.
141+
142+
The rewrite model supports grammar capitalization by default, which capitalizes the first letter of a sentence for `en-US` like locales. It's turned off if the capitalization feature of display text formatting is turned off in a speech recognition request.
143+
144+
#### Grammar punctuation
145+
146+
Grammar punctuation characters are used to separate a sentence or phrase, and clarify how a sentence or phrase should be read.
147+
148+
> `. , ? 、 ! : ; ? 。 , ¿ ¡ । ؟ ، `
149+
150+
Here are the grammar punctuation rules:
151+
- The supported punctuation characters are for grammar punctuation if they're followed by space or at the beginning or end of a sentence or phrase. For example, the `.` in `x. y` (with a space between `.` and `y`) is a grammar punctuation.
152+
- Punctuation characters that are in the middle of a word (except `zh-cn` and `ja-jp`) aren't grammar punctuation. In that case, they're ordinary characters. For example, the `.` in `x.y` isn't a grammar punctuation.
153+
- For `zh-cn` and `ja-jp` (nonspacing locales), punctuation characters are always used as grammar punctuation even if they are between characters. For example, the `.` in `中.文` is a grammar punctuation.
154+
155+
### Custom rewrite examples
156+
157+
#### Spelling correction
158+
159+
The name `CVOID-19` might be recognized as `covered 19`. To make sure that `COVID-19 is a virus` is displayed instead of `covered 19 is a virus`, use the following rewrite rule:
160+
161+
```text
162+
#rewrite
163+
covered 19{TAB}COVID-19
164+
```
165+
166+
#### Name capitalization
167+
168+
Gottfried Wilhelm Leibniz was a German mathematician. To make sure that `Gottfried Wilhelm Leibniz` is capitalized, use the following rewrite rule:
169+
170+
```text
171+
#rewrite
172+
gottfried leibniz{TAB}Gottfried Leibniz
173+
```
174+
175+
## Custom profanity
176+
177+
A custom profanity model acts the same as the base profanity model, except it uses a custom profanity phrase list. In addition, the custom profanity model tries to match (case insensitive) all the profanity phrases defined in the display text formatting file.
178+
- The profanity phrases are matched (case-insensitive).
179+
- If any profanity phrases rules conflict, the longest phrase is used as the match.
180+
- These punctuation characters aren't supported in a profanity phrase: `. , ? 、 ! : ; ? 。 , ¿ ¡ । ؟ ، `.
181+
- For `zh-CN` and `ja-JP` locales, English profanity phrases aren't supported. English profanity words are supported. Profanity phrases for `zh-CN` and `ja-JP` locales are supported.
182+
183+
The profanity is removed or masked depending on your speech recognition request settings.
184+
185+
Once profanity is added in the display text format rule file and the custom model is trained, it's used for the default output in batch speech to text and real-time speech to text.
186+
187+
### Custom profanity examples
188+
189+
Here are some examples of how to mask profanity words and phrases in the display text formatting file.
190+
191+
#### Mask single profanity word example
192+
193+
Assume `xyz` is a profanity word. To add it:
194+
195+
```text
196+
#profanity
197+
xyz
198+
```
199+
200+
Here's a test sample: `Turned on profanity masking to mask xyz -> Turned on profanity masking to mask ***`
201+
202+
#### Mask profanity phrase
203+
204+
Assume `abc lmn` is a profanity phrase. To add it:
205+
206+
```text
207+
#profanity
208+
abc lmn
209+
```
210+
211+
Here's a test sample: `Turned on profanity masking to mask abc lmn -> Turned on profanity masking to mask *** ***`
212+
213+
## Next Steps
214+
215+
- [Test model quantitatively](how-to-custom-speech-evaluate-data.md)
216+
- [Test recognition quality](how-to-custom-speech-inspect-data.md)
217+
- [Train your model](how-to-custom-speech-train-model.md)

articles/ai-services/speech-service/how-to-custom-speech-test-and-train.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,53 @@ Use <a href="http://sox.sourceforge.net" target="_blank" rel="noopener">SoX</a>
267267
| Check the audio file format. | `sox --i <filename>` |
268268
| Convert the audio file to single channel, 16-bit, 16 KHz. | `sox <input> -b 16 -e signed-integer -c 1 -r 16k -t wav <output>.wav` |
269269

270+
### Custom display text formatting data for training
271+
272+
Learn more about [display text formatting with speech to text](./display-text-format.md).
273+
274+
Automatic Speech Recognition output display format is critical to downstream tasks and one-size doesn’t fit all. Adding Custom Display Format rules allows users to define their own lexical-to-display format rules to improve the speech recognition service quality on top of Microsoft Azure Custom Speech Service.
275+
276+
It allows you to fully customize the display outputs such as add rewrite rules to capitalize and reformulate certain words, add profanity words and mask from output, define advanced ITN rules for certain patterns such as numbers, dates, email addresses; or preserve some phrases and kept them from any Display processes.
277+
278+
For example:
279+
280+
| Custom formatting | Display text |
281+
|-------------------|--------------|
282+
|None|My financial number from contoso is 8BEV3|
283+
|Capitalize "Contoso" (via `#rewrite` rule)<br/>Format financial number (via `#itn` rule)|My financial number from Contoso is 8B-EV-3|
284+
285+
For a list of supported base models and locales for training with structured text, see Language support.
286+
The Display Format file should have an .md extension. The maximum file size is 10 MB, and the text encoding must be UTF-8 BOM. For more information about customizing Display Format rules, see Display Formatting Rules Best Practice.
287+
288+
|Property|Description|Limits|
289+
|--------|-----------|------|
290+
|#ITN|A list of invert-text-normalization rules to define certain display patterns such as numbers, addresses, and dates.|Maximum of 200 lines|
291+
|#rewrite|A list of rewrite pairs to replace certain words for reasons such as capitalization and spelling correction.|Maximum of 1,000 lines|
292+
|#profanity|A list of unwanted words that will be masked as `******` from Display and Masked output, on top of Microsoft built-in profanity lists.|Maximum of 1,000 lines|
293+
|#test|A list of unit test cases to validate if the display rules work as expected, including the lexical format input and the expected display format output.|Maximum file size of 10MB|
294+
295+
Here's an example display format file:
296+
297+
```text in .md file
298+
// this is a comment line
299+
// each section must start with a '#' character
300+
#itn
301+
// list of ITN pattern rules, one rule for each line
302+
\d-\d-\d
303+
\d-\l-\l-\d
304+
#rewrite
305+
// list of rewrite rules, each rule has two phrases, separated by a tab character
306+
old phrase new phrase
307+
# profanity
308+
// list of profanity phrases to be tagged/removed/masked, one line one phrase
309+
fakeprofanity
310+
#test
311+
// list of test cases, each test case has two sentences, input lexical and expected display output
312+
// the two sentences are separated by a tab character
313+
// the expected sentence is the display output of DPP+CDPP models
314+
Mask the fakeprofanity word Mask the ************* word
315+
```
316+
270317
## Next steps
271318

272319
- [Upload your data](how-to-custom-speech-upload-data.md)

articles/ai-services/speech-service/how-to-use-audio-input-streams.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The Speech SDK provides a way to stream audio into the recognizer as an alternat
1818

1919
This guide describes how to use audio input streams. It also describes some of the requirements and limitations of the audio input stream.
2020

21-
See more examples of speech-to-text recognition with audio input stream on [GitHub](https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/sharedcontent/console/speech_recognition_samples.cs).
21+
See more examples of speech to text recognition with audio input stream on [GitHub](https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/sharedcontent/console/speech_recognition_samples.cs).
2222

2323
## Identify the format of the audio stream
2424

articles/ai-services/speech-service/includes/how-to/custom-speech/cli-api-kind.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Although you don't indicate whether the dataset is for testing or training, you
1717
|Language |Training data: Plain text |
1818
|LanguageMarkdown |Training data: Structured text in markdown format |
1919
|Pronunciation |Training data: Pronunciation |
20+
|OutputFormatting |Training data: Output format |
2021

2122
> [!NOTE]
2223
> Structured text in markdown format training datasets are not supported by version 3.0 of the Speech to text REST API. You must use the [Speech to text REST API v3.1](~/articles/ai-services/speech-service/rest-speech-to-text.md). For more information, see [Migrate code from v3.0 to v3.1 of the REST API](~/articles/ai-services/speech-service/migrate-v3-0-to-v3-1.md).

articles/ai-services/speech-service/includes/how-to/translate-speech/intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,4 @@ See the speech translation [overview](../../../speech-translation.md) for more i
1212

1313
* Translating speech to text
1414
* Translating speech to multiple target languages
15-
* Performing direct speech-to-speech translation
15+
* Performing direct speech to speech translation

articles/ai-services/speech-service/includes/quickstarts/stt-diarization/intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ In this quickstart, you run an application for speech to text transcription with
1414
The speaker information is included in the result in the speaker ID field. The speaker ID is a generic identifier assigned to each conversation participant by the service during the recognition as different speakers are being identified from the provided audio content.
1515

1616
> [!TIP]
17-
> You can try real-time speech-to-text in [Speech Studio](https://aka.ms/speechstudio/speechtotexttool) without signing up or writing any code. However, the Speech Studio doesn't yet support diarization.
17+
> You can try real-time speech to text in [Speech Studio](https://aka.ms/speechstudio/speechtotexttool) without signing up or writing any code. However, the Speech Studio doesn't yet support diarization.

0 commit comments

Comments
 (0)