feat(pronscore): REST API for prosody and SNR

Ke WANG · Ke WANG · commit 6e8fd37efbbd · 2024-12-23T10:38:04.000+08:00
diff --git a/articles/ai-services/speech-service/rest-speech-to-text-short.md b/articles/ai-services/speech-service/rest-speech-to-text-short.md
@@ -15,7 +15,7 @@ ms.custom: devx-track-csharp
 
 # Speech to text REST API for short audio
 
-Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md). 
+Use cases for the Speech to text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
 
 Before you use the Speech to text REST API for short audio, consider the following limitations:
 
@@ -90,6 +90,7 @@ This table lists required and optional parameters for pronunciation assessment:
 | `Granularity` | The evaluation granularity. Accepted values are:<br><br> `Phoneme`, which shows the score on the full-text, word, and phoneme levels.<br>`Word`, which shows the score on the full-text and word levels. <br>`FullText`, which shows the score on the full-text level only.<br><br> The default setting is `Phoneme`. | Optional |
 | `Dimension` | Defines the output criteria. Accepted values are:<br><br> `Basic`, which shows the accuracy score only. <br>`Comprehensive`, which shows scores on more dimensions (for example, fluency score and completeness score on the full-text level, and error type on the word level).<br><br> To see definitions of different score dimensions and word error types, see [Response properties](#response-properties). The default setting is `Basic`. | Optional |
 | `EnableMiscue` | Enables miscue calculation. With this parameter enabled, the pronounced words are compared to the reference text. They are marked with omission or insertion based on the comparison. Accepted values are `False` and `True`. The default setting is `False`. | Optional |
+| `EnableProsodyAssessment` | Enables prosody assessment for your pronunciation evaluation. This feature assesses aspects like stress, intonation, speaking speed, and rhythm. This feature provides insights into the naturalness and expressiveness of your speech.<br><br> If this method is called, the `ProsodyScore` result value is returned. | Optional |
 | `ScenarioId` | A GUID that indicates a customized point system. | Optional |
 
 Here's example JSON that contains the pronunciation assessment parameters:
@@ -98,23 +99,24 @@ Here's example JSON that contains the pronunciation assessment parameters:
 {
   "ReferenceText": "Good morning.",
   "GradingSystem": "HundredMark",
-  "Granularity": "FullText",
-  "Dimension": "Comprehensive"
+  "Granularity": "Word",
+  "Dimension": "Comprehensive",
+  "EnableProsodyAssessment": "True"
 }
 ```
 
 The following sample code shows how to build the pronunciation assessment parameters into the `Pronunciation-Assessment` header:
 
 ```csharp
-var pronAssessmentParamsJson = $"{{\"ReferenceText\":\"Good morning.\",\"GradingSystem\":\"HundredMark\",\"Granularity\":\"FullText\",\"Dimension\":\"Comprehensive\"}}";
+var pronAssessmentParamsJson = $"{{\"ReferenceText\":\"Good morning.\",\"GradingSystem\":\"HundredMark\",\"Granularity\":\"Word\",\"Dimension\":\"Comprehensive\",\"EnableProsodyAssessment\":\"True\"}}";
 var pronAssessmentParamsBytes = Encoding.UTF8.GetBytes(pronAssessmentParamsJson);
 var pronAssessmentHeader = Convert.ToBase64String(pronAssessmentParamsBytes);
 ```
 
 We strongly recommend streaming ([chunked transfer](#chunked-transfer)) uploading while you're posting the audio data, which can significantly reduce the latency. To learn how to enable streaming, see the [sample code in various programming languages](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment).
 
 > [!NOTE]
-> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md). 
+> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
 
 ## Sample request
 
@@ -193,36 +195,83 @@ Here's a typical response for recognition with pronunciation assessment:
 ```json
 {
   "RecognitionStatus": "Success",
-  "Offset": "400000",
-  "Duration": "11000000",
+  "Offset": 700000,
+  "Duration": 8400000,
+  "DisplayText": "Good morning.",
+  "SNR": 38.76819,
   "NBest": [
-      {
-        "Confidence" : "0.87",
-        "Lexical" : "good morning",
-        "ITN" : "good morning",
-        "MaskedITN" : "good morning",
-        "Display" : "Good morning.",
-        "PronScore" : 84.4,
-        "AccuracyScore" : 100.0,
-        "FluencyScore" : 74.0,
-        "CompletenessScore" : 100.0,
-        "Words": [
-            {
-              "Word" : "Good",
-              "AccuracyScore" : 100.0,
-              "ErrorType" : "None",
-              "Offset" : 500000,
-              "Duration" : 2700000
-            },
-            {
-              "Word" : "morning",
-              "AccuracyScore" : 100.0,
-              "ErrorType" : "None",
-              "Offset" : 5300000,
-              "Duration" : 900000
+    {
+      "Confidence": 0.98503506,
+      "Lexical": "good morning",
+      "ITN": "good morning",
+      "MaskedITN": "good morning",
+      "Display": "Good morning.",
+      "AccuracyScore": 100.0,
+      "FluencyScore": 100.0,
+      "ProsodyScore": 87.8,
+      "CompletenessScore": 100.0,
+      "PronScore": 95.1,
+      "Words": [
+        {
+          "Word": "good",
+          "Offset": 700000,
+          "Duration": 2600000,
+          "Confidence": 0.0,
+          "AccuracyScore": 100.0,
+          "ErrorType": "None",
+          "Feedback": {
+            "Prosody": {
+              "Break": {
+                "ErrorTypes": [
+                  "None"
+                ],
+                "BreakLength": 0
+              },
+              "Intonation": {
+                "ErrorTypes": [],
+                "Monotone": {
+                  "Confidence": 0.0,
+                  "WordPitchSlopeConfidence": 0.0,
+                  "SyllablePitchDeltaConfidence": 0.91385907
+                }
+              }
+            }
+          }
+        },
+        {
+          "Word": "morning",
+          "Offset": 3400000,
+          "Duration": 5700000,
+          "Confidence": 0.0,
+          "AccuracyScore": 100.0,
+          "ErrorType": "None",
+          "Feedback": {
+            "Prosody": {
+              "Break": {
+                "ErrorTypes": [
+                  "None"
+                ],
+                "UnexpectedBreak": {
+                  "Confidence": 3.5294118e-08
+                },
+                "MissingBreak": {
+                  "Confidence": 1.0
+                },
+                "BreakLength": 0
+              },
+              "Intonation": {
+                "ErrorTypes": [],
+                "Monotone": {
+                  "Confidence": 0.0,
+                  "WordPitchSlopeConfidence": 0.0,
+                  "SyllablePitchDeltaConfidence": 0.91385907
+                }
+              }
             }
-        ]
-      }
+          }
+        }
+      ]
+    }
   ]
 }
 ```
@@ -237,6 +286,7 @@ Results are provided as JSON. The `simple` format includes the following top-lev
 |`DisplayText`|The recognized text after capitalization, punctuation, inverse text normalization, and profanity masking. Present only on success. Inverse text normalization is conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith."|
 |`Offset`|The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.|
 |`Duration`|The duration (in 100-nanosecond units) of the recognized speech in the audio stream.|
+|`SNR`|The signal-to-noise ratio (SNR) of the recognized speech in the audio stream.|
 
 The `RecognitionStatus` field might contain these values:
 
@@ -265,6 +315,7 @@ The object in the `NBest` list can include:
 | `Display` | The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as what `DisplayText` provides when the format is set to `simple`. |
 | `AccuracyScore` | Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. The accuracy score at the word and full-text levels is aggregated from the accuracy score at the phoneme level. |
 | `FluencyScore` | Fluency of the provided speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. |
+| `ProsodyScore` | Prosody of the given speech. Prosody indicates how natural the given speech is, including stress, intonation, speaking speed, and rhythm.<br><br> To see definitions of prosody assessment results in details, see [Result parameters](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-csharp#result-parameters). |
 | `CompletenessScore` | Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input. |
 | `PronScore` | Overall score that indicates the pronunciation quality of the provided speech. This score is aggregated from `AccuracyScore`, `FluencyScore`, and `CompletenessScore` with weight. |
 | `ErrorType` | Value that indicates whether a word is omitted, inserted, or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion`, and `Mispronunciation`. |
@@ -313,4 +364,3 @@ using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
 
 - [Customize speech models](./how-to-custom-speech-train-model.md)
 - [Get familiar with batch transcription](batch-transcription.md)
-