You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cognitive-services/Speech-Service/rest-speech-to-text.md
+80-3Lines changed: 80 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,13 +3,13 @@ title: Speech-to-text API reference (REST) - Speech service
3
3
titleSuffix: Azure Cognitive Services
4
4
description: Learn how to use the speech-to-text REST API. In this article, you'll learn about authorization options, query options, how to structure a request and receive a response.
5
5
services: cognitive-services
6
-
author: trevorbye
6
+
author: yinhew
7
7
manager: nitinme
8
8
ms.service: cognitive-services
9
9
ms.subservice: speech-service
10
10
ms.topic: conceptual
11
-
ms.date: 03/16/2020
12
-
ms.author: trbye
11
+
ms.date: 04/23/2020
12
+
ms.author: yinhew
13
13
---
14
14
15
15
# Speech-to-text REST API
@@ -49,6 +49,7 @@ These parameters may be included in the query string of the REST request.
49
49
|`language`| Identifies the spoken language that is being recognized. See [Supported languages](language-support.md#speech-to-text). | Required |
50
50
|`format`| Specifies the result format. Accepted values are `simple` and `detailed`. Simple results include `RecognitionStatus`, `DisplayText`, `Offset`, and `Duration`. Detailed responses include multiple results with confidence values and four different representations. The default setting is `simple`. | Optional |
51
51
|`profanity`| Specifies how to handle profanity in recognition results. Accepted values are `masked`, which replaces profanity with asterisks, `removed`, which removes all profanity from the result, or `raw`, which includes the profanity in the result. The default setting is `masked`. | Optional |
52
+
|`pronunciationScoreParams`| Specifies the parameters for showing pronunciation scores in recognition results, which assess the pronunciation quality of speech input, with indicators of accuracy, fluency, completeness, etc. This parameter is a base64 encoded json containing multiple detailed parameters. See [Pronunciation assessment parameters](#pronunciation-assessment-parameters) for how to build this parameter. | Optional |
52
53
|`cid`| When using the [Custom Speech portal](how-to-custom-speech.md) to create custom models, you can use custom models via their **Endpoint ID** found on the **Deployment** page. Use the **Endpoint ID** as the argument to the `cid` query string parameter. | Optional |
53
54
54
55
## Request headers
@@ -76,6 +77,38 @@ Audio is sent in the body of the HTTP `POST` request. It must be in one of the f
76
77
>[!NOTE]
77
78
>The above formats are supported through REST API and WebSocket in the Speech service. The [Speech SDK](speech-sdk.md) currently supports the WAV format with PCM codec as well as [other formats](how-to-use-codec-compressed-audio-input-streams.md).
78
79
80
+
## Pronunciation assessment parameters
81
+
82
+
This table lists required and optional parameters for pronunciation assessment.
83
+
84
+
| Parameter | Description | Required / Optional |
85
+
|-----------|-------------|---------------------|
86
+
| ReferenceText | The text that the pronunciation will be evaluated against. | Required |
87
+
| GradingSystem | The point system for score calibration. Accepted values are `FivePoint` and `HundredMark`. The default setting is `FivePoint`. | Optional |
88
+
| Granularity | The evaluation granularity. Accepted values are `Phoneme`, which shows the score on the full text, word and phoneme level, `Word`, which shows the score on the full text and word level, `FullText`, which shows the score on the full text level only. The default setting is `Phoneme`. | Optional |
89
+
| Dimension | Defines the output criteria. Accepted values are `Basic`, which shows the accuracy score only, `Comprehensive` shows scores on more dimensions (e.g. fluency score and completeness score on the full text level, error type on word level). Check [Response parameters](#response-parameters) to see definitions of different score dimensions and word error types. The default setting is `Basic`. | Optional |
90
+
| EnableMiscue | Enables miscue calculation. With this enabled, the pronounced words will be compared to the reference text, and will be marked with omission/insertion based on the comparison. Accepted values are `False` and `True`. The default setting is `False`. | Optional |
91
+
| ScenarioId | A GUID indicating a customized point system. | Optional |
92
+
93
+
Below is an example JSON containing the pronunciation assessment parameters:
94
+
95
+
```json
96
+
{
97
+
"ReferenceText": "Good morning.",
98
+
"GradingSystem": "HundredMark",
99
+
"Granularity": "FullText",
100
+
"Dimension": "Comprehensive"
101
+
}
102
+
```
103
+
104
+
The following sample code shows how to build the pronunciation assessment parameters into the URL query parameter:
The sample below includes the hostname and required headers. It's important to note that the service also expects audio data, which is not included in this sample. As mentioned earlier, chunking is recommended, however, not required.
@@ -173,6 +206,11 @@ Each object in the `NBest` list includes:
173
206
|`ITN`| The inverse-text-normalized ("canonical") form of the recognized text, with phone numbers, numbers, abbreviations ("doctor smith" to "dr smith"), and other transformations applied. |
174
207
|`MaskedITN`| The ITN form with profanity masking applied, if requested. |
175
208
|`Display`| The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as `DisplayText` provided when format is set to `simple`. |
209
+
|`AccuracyScore`| The score indicating the pronunciation accuracy of the given speech. |
210
+
|`FluencyScore`| The score indicating the fluency of the given speech. |
211
+
|`CompletenessScore`| The score indicating the completeness of the given speech by calculating the ratio of pronounced words towards entire input. |
212
+
|`PronScore`| The overall score indicating the pronunciation quality of the given speech. This is calculated from `AccuracyScore`, `FluencyScore` and `CompletenessScore` with weight. |
213
+
|`ErrorType`| This value indicates whether a word is omitted, inserted or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion` and `Mispronunciation`. |
176
214
177
215
## Sample responses
178
216
@@ -213,6 +251,45 @@ A typical response for `detailed` recognition:
213
251
}
214
252
```
215
253
254
+
A typical response for recognition with pronunciation assessment:
255
+
256
+
```json
257
+
{
258
+
"RecognitionStatus": "Success",
259
+
"Offset": "400000",
260
+
"Duration": "11000000",
261
+
"NBest": [
262
+
{
263
+
"Confidence" : "0.87",
264
+
"Lexical" : "good morning",
265
+
"ITN" : "good morning",
266
+
"MaskedITN" : "good morning",
267
+
"Display" : "Good morning.",
268
+
"PronScore" : 84.4,
269
+
"AccuracyScore" : 100.0,
270
+
"FluencyScore" : 74.0,
271
+
"CompletenessScore" : 100.0,
272
+
"Words": [
273
+
{
274
+
"Word" : "Good",
275
+
"AccuracyScore" : 100.0,
276
+
"ErrorType" : "None",
277
+
"Offset" : 500000,
278
+
"Duration" : 2700000
279
+
},
280
+
{
281
+
"Word" : "morning",
282
+
"AccuracyScore" : 100.0,
283
+
"ErrorType" : "None",
284
+
"Offset" : 5300000,
285
+
"Duration" : 900000
286
+
}
287
+
]
288
+
}
289
+
]
290
+
}
291
+
```
292
+
216
293
## Next steps
217
294
218
295
-[Get your Speech trial subscription](https://azure.microsoft.com/try/cognitive-services/)
0 commit comments