You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/speech-service/rest-speech-to-text-short.md
+84-34Lines changed: 84 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ ms.custom: devx-track-csharp
15
15
16
16
# Speech to text REST API for short audio
17
17
18
-
Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
18
+
Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
19
19
20
20
Before you use the Speech to text REST API for short audio, consider the following limitations:
21
21
@@ -90,6 +90,7 @@ This table lists required and optional parameters for pronunciation assessment:
90
90
|`Granularity`| The evaluation granularity. Accepted values are:<br><br> `Phoneme`, which shows the score on the full-text, word, and phoneme levels.<br>`Word`, which shows the score on the full-text and word levels. <br>`FullText`, which shows the score on the full-text level only.<br><br> The default setting is `Phoneme`. | Optional |
91
91
|`Dimension`| Defines the output criteria. Accepted values are:<br><br> `Basic`, which shows the accuracy score only. <br>`Comprehensive`, which shows scores on more dimensions (for example, fluency score and completeness score on the full-text level, and error type on the word level).<br><br> To see definitions of different score dimensions and word error types, see [Response properties](#response-properties). The default setting is `Basic`. | Optional |
92
92
|`EnableMiscue`| Enables miscue calculation. With this parameter enabled, the pronounced words are compared to the reference text. They are marked with omission or insertion based on the comparison. Accepted values are `False` and `True`. The default setting is `False`. | Optional |
93
+
|`EnableProsodyAssessment`| Enables prosody assessment for your pronunciation evaluation. This feature assesses aspects like stress, intonation, speaking speed, and rhythm. This feature provides insights into the naturalness and expressiveness of your speech.<br><br> If this method is called, the `ProsodyScore` result value is returned. | Optional |
93
94
|`ScenarioId`| A GUID that indicates a customized point system. | Optional |
94
95
95
96
Here's example JSON that contains the pronunciation assessment parameters:
@@ -98,23 +99,24 @@ Here's example JSON that contains the pronunciation assessment parameters:
98
99
{
99
100
"ReferenceText": "Good morning.",
100
101
"GradingSystem": "HundredMark",
101
-
"Granularity": "FullText",
102
-
"Dimension": "Comprehensive"
102
+
"Granularity": "Word",
103
+
"Dimension": "Comprehensive",
104
+
"EnableProsodyAssessment": "True"
103
105
}
104
106
```
105
107
106
108
The following sample code shows how to build the pronunciation assessment parameters into the `Pronunciation-Assessment` header:
We strongly recommend streaming ([chunked transfer](#chunked-transfer)) uploading while you're posting the audio data, which can significantly reduce the latency. To learn how to enable streaming, see the [sample code in various programming languages](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment).
115
117
116
118
> [!NOTE]
117
-
> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
119
+
> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
118
120
119
121
## Sample request
120
122
@@ -193,36 +195,83 @@ Here's a typical response for recognition with pronunciation assessment:
193
195
```json
194
196
{
195
197
"RecognitionStatus": "Success",
196
-
"Offset": "400000",
197
-
"Duration": "11000000",
198
+
"Offset": 700000,
199
+
"Duration": 8400000,
200
+
"DisplayText": "Good morning.",
201
+
"SNR": 38.76819,
198
202
"NBest": [
199
-
{
200
-
"Confidence" : "0.87",
201
-
"Lexical" : "good morning",
202
-
"ITN" : "good morning",
203
-
"MaskedITN" : "good morning",
204
-
"Display" : "Good morning.",
205
-
"PronScore" : 84.4,
206
-
"AccuracyScore" : 100.0,
207
-
"FluencyScore" : 74.0,
208
-
"CompletenessScore" : 100.0,
209
-
"Words": [
210
-
{
211
-
"Word" : "Good",
212
-
"AccuracyScore" : 100.0,
213
-
"ErrorType" : "None",
214
-
"Offset" : 500000,
215
-
"Duration" : 2700000
216
-
},
217
-
{
218
-
"Word" : "morning",
219
-
"AccuracyScore" : 100.0,
220
-
"ErrorType" : "None",
221
-
"Offset" : 5300000,
222
-
"Duration" : 900000
203
+
{
204
+
"Confidence": 0.98503506,
205
+
"Lexical": "good morning",
206
+
"ITN": "good morning",
207
+
"MaskedITN": "good morning",
208
+
"Display": "Good morning.",
209
+
"AccuracyScore": 100.0,
210
+
"FluencyScore": 100.0,
211
+
"ProsodyScore": 87.8,
212
+
"CompletenessScore": 100.0,
213
+
"PronScore": 95.1,
214
+
"Words": [
215
+
{
216
+
"Word": "good",
217
+
"Offset": 700000,
218
+
"Duration": 2600000,
219
+
"Confidence": 0.0,
220
+
"AccuracyScore": 100.0,
221
+
"ErrorType": "None",
222
+
"Feedback": {
223
+
"Prosody": {
224
+
"Break": {
225
+
"ErrorTypes": [
226
+
"None"
227
+
],
228
+
"BreakLength": 0
229
+
},
230
+
"Intonation": {
231
+
"ErrorTypes": [],
232
+
"Monotone": {
233
+
"Confidence": 0.0,
234
+
"WordPitchSlopeConfidence": 0.0,
235
+
"SyllablePitchDeltaConfidence": 0.91385907
236
+
}
237
+
}
238
+
}
239
+
}
240
+
},
241
+
{
242
+
"Word": "morning",
243
+
"Offset": 3400000,
244
+
"Duration": 5700000,
245
+
"Confidence": 0.0,
246
+
"AccuracyScore": 100.0,
247
+
"ErrorType": "None",
248
+
"Feedback": {
249
+
"Prosody": {
250
+
"Break": {
251
+
"ErrorTypes": [
252
+
"None"
253
+
],
254
+
"UnexpectedBreak": {
255
+
"Confidence": 3.5294118e-08
256
+
},
257
+
"MissingBreak": {
258
+
"Confidence": 1.0
259
+
},
260
+
"BreakLength": 0
261
+
},
262
+
"Intonation": {
263
+
"ErrorTypes": [],
264
+
"Monotone": {
265
+
"Confidence": 0.0,
266
+
"WordPitchSlopeConfidence": 0.0,
267
+
"SyllablePitchDeltaConfidence": 0.91385907
268
+
}
269
+
}
223
270
}
224
-
]
225
-
}
271
+
}
272
+
}
273
+
]
274
+
}
226
275
]
227
276
}
228
277
```
@@ -237,6 +286,7 @@ Results are provided as JSON. The `simple` format includes the following top-lev
237
286
|`DisplayText`|The recognized text after capitalization, punctuation, inverse text normalization, and profanity masking. Present only on success. Inverse text normalization is conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith."|
238
287
|`Offset`|The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.|
239
288
|`Duration`|The duration (in 100-nanosecond units) of the recognized speech in the audio stream.|
289
+
|`SNR`|The signal-to-noise ratio (SNR) of the recognized speech in the audio stream.|
240
290
241
291
The `RecognitionStatus` field might contain these values:
242
292
@@ -265,6 +315,7 @@ The object in the `NBest` list can include:
265
315
|`Display`| The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as what `DisplayText` provides when the format is set to `simple`. |
266
316
|`AccuracyScore`| Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. The accuracy score at the word and full-text levels is aggregated from the accuracy score at the phoneme level. |
267
317
|`FluencyScore`| Fluency of the provided speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. |
318
+
|`ProsodyScore`| Prosody of the given speech. Prosody indicates how natural the given speech is, including stress, intonation, speaking speed, and rhythm.<br><br> To see definitions of prosody assessment results in details, see [Result parameters](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-csharp#result-parameters). |
268
319
|`CompletenessScore`| Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input. |
269
320
|`PronScore`| Overall score that indicates the pronunciation quality of the provided speech. This score is aggregated from `AccuracyScore`, `FluencyScore`, and `CompletenessScore` with weight. |
270
321
|`ErrorType`| Value that indicates whether a word is omitted, inserted, or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion`, and `Mispronunciation`. |
@@ -313,4 +364,3 @@ using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
0 commit comments