Skip to content

Commit 6e8fd37

Browse files
author
Ke WANG
committed
feat(pronscore): REST API for prosody and SNR
1 parent 6fa1034 commit 6e8fd37

File tree

1 file changed

+84
-34
lines changed

1 file changed

+84
-34
lines changed

articles/ai-services/speech-service/rest-speech-to-text-short.md

Lines changed: 84 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ ms.custom: devx-track-csharp
1515

1616
# Speech to text REST API for short audio
1717

18-
Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
18+
Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
1919

2020
Before you use the Speech to text REST API for short audio, consider the following limitations:
2121

@@ -90,6 +90,7 @@ This table lists required and optional parameters for pronunciation assessment:
9090
| `Granularity` | The evaluation granularity. Accepted values are:<br><br> `Phoneme`, which shows the score on the full-text, word, and phoneme levels.<br>`Word`, which shows the score on the full-text and word levels. <br>`FullText`, which shows the score on the full-text level only.<br><br> The default setting is `Phoneme`. | Optional |
9191
| `Dimension` | Defines the output criteria. Accepted values are:<br><br> `Basic`, which shows the accuracy score only. <br>`Comprehensive`, which shows scores on more dimensions (for example, fluency score and completeness score on the full-text level, and error type on the word level).<br><br> To see definitions of different score dimensions and word error types, see [Response properties](#response-properties). The default setting is `Basic`. | Optional |
9292
| `EnableMiscue` | Enables miscue calculation. With this parameter enabled, the pronounced words are compared to the reference text. They are marked with omission or insertion based on the comparison. Accepted values are `False` and `True`. The default setting is `False`. | Optional |
93+
| `EnableProsodyAssessment` | Enables prosody assessment for your pronunciation evaluation. This feature assesses aspects like stress, intonation, speaking speed, and rhythm. This feature provides insights into the naturalness and expressiveness of your speech.<br><br> If this method is called, the `ProsodyScore` result value is returned. | Optional |
9394
| `ScenarioId` | A GUID that indicates a customized point system. | Optional |
9495

9596
Here's example JSON that contains the pronunciation assessment parameters:
@@ -98,23 +99,24 @@ Here's example JSON that contains the pronunciation assessment parameters:
9899
{
99100
"ReferenceText": "Good morning.",
100101
"GradingSystem": "HundredMark",
101-
"Granularity": "FullText",
102-
"Dimension": "Comprehensive"
102+
"Granularity": "Word",
103+
"Dimension": "Comprehensive",
104+
"EnableProsodyAssessment": "True"
103105
}
104106
```
105107

106108
The following sample code shows how to build the pronunciation assessment parameters into the `Pronunciation-Assessment` header:
107109

108110
```csharp
109-
var pronAssessmentParamsJson = $"{{\"ReferenceText\":\"Good morning.\",\"GradingSystem\":\"HundredMark\",\"Granularity\":\"FullText\",\"Dimension\":\"Comprehensive\"}}";
111+
var pronAssessmentParamsJson = $"{{\"ReferenceText\":\"Good morning.\",\"GradingSystem\":\"HundredMark\",\"Granularity\":\"Word\",\"Dimension\":\"Comprehensive\",\"EnableProsodyAssessment\":\"True\"}}";
110112
var pronAssessmentParamsBytes = Encoding.UTF8.GetBytes(pronAssessmentParamsJson);
111113
var pronAssessmentHeader = Convert.ToBase64String(pronAssessmentParamsBytes);
112114
```
113115

114116
We strongly recommend streaming ([chunked transfer](#chunked-transfer)) uploading while you're posting the audio data, which can significantly reduce the latency. To learn how to enable streaming, see the [sample code in various programming languages](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment).
115117

116118
> [!NOTE]
117-
> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
119+
> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
118120
119121
## Sample request
120122

@@ -193,36 +195,83 @@ Here's a typical response for recognition with pronunciation assessment:
193195
```json
194196
{
195197
"RecognitionStatus": "Success",
196-
"Offset": "400000",
197-
"Duration": "11000000",
198+
"Offset": 700000,
199+
"Duration": 8400000,
200+
"DisplayText": "Good morning.",
201+
"SNR": 38.76819,
198202
"NBest": [
199-
{
200-
"Confidence" : "0.87",
201-
"Lexical" : "good morning",
202-
"ITN" : "good morning",
203-
"MaskedITN" : "good morning",
204-
"Display" : "Good morning.",
205-
"PronScore" : 84.4,
206-
"AccuracyScore" : 100.0,
207-
"FluencyScore" : 74.0,
208-
"CompletenessScore" : 100.0,
209-
"Words": [
210-
{
211-
"Word" : "Good",
212-
"AccuracyScore" : 100.0,
213-
"ErrorType" : "None",
214-
"Offset" : 500000,
215-
"Duration" : 2700000
216-
},
217-
{
218-
"Word" : "morning",
219-
"AccuracyScore" : 100.0,
220-
"ErrorType" : "None",
221-
"Offset" : 5300000,
222-
"Duration" : 900000
203+
{
204+
"Confidence": 0.98503506,
205+
"Lexical": "good morning",
206+
"ITN": "good morning",
207+
"MaskedITN": "good morning",
208+
"Display": "Good morning.",
209+
"AccuracyScore": 100.0,
210+
"FluencyScore": 100.0,
211+
"ProsodyScore": 87.8,
212+
"CompletenessScore": 100.0,
213+
"PronScore": 95.1,
214+
"Words": [
215+
{
216+
"Word": "good",
217+
"Offset": 700000,
218+
"Duration": 2600000,
219+
"Confidence": 0.0,
220+
"AccuracyScore": 100.0,
221+
"ErrorType": "None",
222+
"Feedback": {
223+
"Prosody": {
224+
"Break": {
225+
"ErrorTypes": [
226+
"None"
227+
],
228+
"BreakLength": 0
229+
},
230+
"Intonation": {
231+
"ErrorTypes": [],
232+
"Monotone": {
233+
"Confidence": 0.0,
234+
"WordPitchSlopeConfidence": 0.0,
235+
"SyllablePitchDeltaConfidence": 0.91385907
236+
}
237+
}
238+
}
239+
}
240+
},
241+
{
242+
"Word": "morning",
243+
"Offset": 3400000,
244+
"Duration": 5700000,
245+
"Confidence": 0.0,
246+
"AccuracyScore": 100.0,
247+
"ErrorType": "None",
248+
"Feedback": {
249+
"Prosody": {
250+
"Break": {
251+
"ErrorTypes": [
252+
"None"
253+
],
254+
"UnexpectedBreak": {
255+
"Confidence": 3.5294118e-08
256+
},
257+
"MissingBreak": {
258+
"Confidence": 1.0
259+
},
260+
"BreakLength": 0
261+
},
262+
"Intonation": {
263+
"ErrorTypes": [],
264+
"Monotone": {
265+
"Confidence": 0.0,
266+
"WordPitchSlopeConfidence": 0.0,
267+
"SyllablePitchDeltaConfidence": 0.91385907
268+
}
269+
}
223270
}
224-
]
225-
}
271+
}
272+
}
273+
]
274+
}
226275
]
227276
}
228277
```
@@ -237,6 +286,7 @@ Results are provided as JSON. The `simple` format includes the following top-lev
237286
|`DisplayText`|The recognized text after capitalization, punctuation, inverse text normalization, and profanity masking. Present only on success. Inverse text normalization is conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith."|
238287
|`Offset`|The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.|
239288
|`Duration`|The duration (in 100-nanosecond units) of the recognized speech in the audio stream.|
289+
|`SNR`|The signal-to-noise ratio (SNR) of the recognized speech in the audio stream.|
240290

241291
The `RecognitionStatus` field might contain these values:
242292

@@ -265,6 +315,7 @@ The object in the `NBest` list can include:
265315
| `Display` | The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as what `DisplayText` provides when the format is set to `simple`. |
266316
| `AccuracyScore` | Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. The accuracy score at the word and full-text levels is aggregated from the accuracy score at the phoneme level. |
267317
| `FluencyScore` | Fluency of the provided speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. |
318+
| `ProsodyScore` | Prosody of the given speech. Prosody indicates how natural the given speech is, including stress, intonation, speaking speed, and rhythm.<br><br> To see definitions of prosody assessment results in details, see [Result parameters](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-csharp#result-parameters). |
268319
| `CompletenessScore` | Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input. |
269320
| `PronScore` | Overall score that indicates the pronunciation quality of the provided speech. This score is aggregated from `AccuracyScore`, `FluencyScore`, and `CompletenessScore` with weight. |
270321
| `ErrorType` | Value that indicates whether a word is omitted, inserted, or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion`, and `Mispronunciation`. |
@@ -313,4 +364,3 @@ using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
313364

314365
- [Customize speech models](./how-to-custom-speech-train-model.md)
315366
- [Get familiar with batch transcription](batch-transcription.md)
316-

0 commit comments

Comments
 (0)