Skip to content

Commit cb7c8be

Browse files
authored
h2 clarity etc
1 parent b5d219c commit cb7c8be

File tree

1 file changed

+108
-108
lines changed

1 file changed

+108
-108
lines changed

articles/cognitive-services/Speech-Service/rest-speech-to-text-short.md

Lines changed: 108 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -8,28 +8,24 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: speech-service
1010
ms.topic: reference
11-
ms.date: 05/16/2022
11+
ms.date: 09/25/2022
1212
ms.author: eur
1313
ms.devlang: csharp
1414
ms.custom: devx-track-csharp
1515
---
1616

1717
# Speech-to-text REST API for short audio
1818

19-
Use cases for the speech-to-text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md). For [Batch transcription](batch-transcription.md) and [Custom Speech](custom-speech-overview.md), you should always use [Speech to Text REST API](rest-speech-to-text.md).
19+
Use cases for the speech-to-text REST API for short audio are limited. Use it only in cases where you can't use the [Speech SDK](speech-sdk.md).
2020

2121
Before you use the speech-to-text REST API for short audio, consider the following limitations:
2222

23-
* Requests that use the REST API for short audio and transmit audio directly can contain no more than 60 seconds of audio.
23+
* Requests that use the REST API for short audio and transmit audio directly can contain no more than 60 seconds of audio. The input [audio formats](#audio-formats) are more limited compared to the [Speech SDK](speech-sdk.md).
2424
* The REST API for short audio returns only final results. It doesn't provide partial results.
2525
* [Speech translation](speech-translation.md) is not supported via REST API for short audio. You need to use [Speech SDK](speech-sdk.md).
26+
* [Batch transcription](batch-transcription.md) and [Custom Speech](custom-speech-overview.md) are not supported via REST API for short audio. You should always use the [Speech to Text REST API](rest-speech-to-text.md) for batch transcription and Custom Speech.
2627

27-
> [!TIP]
28-
> For Azure Government and Azure China endpoints, see [this article about sovereign clouds](sovereign-clouds.md).
29-
30-
[!INCLUDE [](includes/cognitive-services-speech-service-rest-auth.md)]
31-
32-
### Regions and endpoints
28+
## Regions and endpoints
3329

3430
The endpoint for the REST API for short audio has this format:
3531

@@ -39,10 +35,25 @@ https://<REGION_IDENTIFIER>.stt.speech.microsoft.com/speech/recognition/conversa
3935

4036
Replace `<REGION_IDENTIFIER>` with the identifier that matches the [region](regions.md) of your Speech resource.
4137

38+
> [!NOTE]
39+
> For Azure Government and Azure China endpoints, see [this article about sovereign clouds](sovereign-clouds.md).
40+
4241
> [!NOTE]
4342
> You must append the language parameter to the URL to avoid receiving a 4xx HTTP error. For example, the language set to US English via the West US endpoint is: `https://westus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US`.
4443
45-
### Query parameters
44+
## Audio formats
45+
46+
Audio is sent in the body of the HTTP `POST` request. It must be in one of the formats in this table:
47+
48+
| Format | Codec | Bit rate | Sample rate |
49+
|--------|-------|----------|--------------|
50+
| WAV | PCM | 256 kbps | 16 kHz, mono |
51+
| OGG | OPUS | 256 kpbs | 16 kHz, mono |
52+
53+
> [!NOTE]
54+
> The preceding formats are supported through the REST API for short audio and WebSocket in the Speech service. The [Speech SDK](speech-sdk.md) supports the WAV format with PCM codec as well as [other formats](how-to-use-codec-compressed-audio-input-streams.md).
55+
56+
## Query parameters
4657

4758
These parameters might be included in the query string of the REST request:
4859

@@ -53,7 +64,7 @@ These parameters might be included in the query string of the REST request:
5364
| `profanity` | Specifies how to handle profanity in recognition results. Accepted values are: <br><br>`masked`, which replaces profanity with asterisks. <br>`removed`, which removes all profanity from the result. <br>`raw`, which includes profanity in the result. <br><br>The default setting is `masked`. | Optional |
5465
| `cid` | When you're using the [Speech Studio](speech-studio-overview.md) to create [custom models](./custom-speech-overview.md), you can take advantage of the **Endpoint ID** value from the **Deployment** page. Use the **Endpoint ID** value as the argument to the `cid` query string parameter. | Optional |
5566

56-
### Request headers
67+
## Request headers
5768

5869
This table lists required and optional headers for speech-to-text requests:
5970

@@ -67,19 +78,7 @@ This table lists required and optional headers for speech-to-text requests:
6778
| `Expect` | If you're using chunked transfer, send `Expect: 100-continue`. The Speech service acknowledges the initial request and awaits additional data.| Required if you're sending chunked audio data. |
6879
| `Accept` | If provided, it must be `application/json`. The Speech service provides results in JSON. Some request frameworks provide an incompatible default value. It's good practice to always include `Accept`. | Optional, but recommended. |
6980

70-
### Audio formats
71-
72-
Audio is sent in the body of the HTTP `POST` request. It must be in one of the formats in this table:
73-
74-
| Format | Codec | Bit rate | Sample rate |
75-
|--------|-------|----------|--------------|
76-
| WAV | PCM | 256 kbps | 16 kHz, mono |
77-
| OGG | OPUS | 256 kpbs | 16 kHz, mono |
78-
79-
>[!NOTE]
80-
>The preceding formats are supported through the REST API for short audio and WebSocket in the Speech service. The [Speech SDK](speech-sdk.md) supports the WAV format with PCM codec as well as [other formats](how-to-use-codec-compressed-audio-input-streams.md).
81-
82-
### Pronunciation assessment parameters
81+
## Pronunciation assessment parameters
8382

8483
This table lists required and optional parameters for pronunciation assessment:
8584

@@ -111,12 +110,12 @@ var pronAssessmentParamsBytes = Encoding.UTF8.GetBytes(pronAssessmentParamsJson)
111110
var pronAssessmentHeader = Convert.ToBase64String(pronAssessmentParamsBytes);
112111
```
113112

114-
We strongly recommend streaming (chunked) uploading while you're posting the audio data, which can significantly reduce the latency. To learn how to enable streaming, see the [sample code in various programming languages](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment).
113+
We strongly recommend streaming ([chunked transfer](#chunked-transfer)) uploading while you're posting the audio data, which can significantly reduce the latency. To learn how to enable streaming, see the [sample code in various programming languages](https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment).
115114

116-
>[!NOTE]
115+
> [!NOTE]
117116
> For more For more information, see [pronunciation assessment](how-to-pronunciation-assessment.md).
118117
119-
### Sample request
118+
## Sample request
120119

121120
The following sample includes the host name and required headers. It's important to note that the service also expects audio data, which is not included in this sample. As mentioned earlier, chunking is recommended but not required.
122121

@@ -148,85 +147,8 @@ The HTTP status code for each response indicates success or common errors.
148147
| 401 | Unauthorized | A resource key or an authorization token is invalid in the specified region, or an endpoint is invalid. |
149148
| 403 | Forbidden | A resource key or authorization token is missing. |
150149

151-
### Chunked transfer
152-
153-
Chunked transfer (`Transfer-Encoding: chunked`) can help reduce recognition latency. It allows the Speech service to begin processing the audio file while it's transmitted. The REST API for short audio does not provide partial or interim results.
154-
155-
The following code sample shows how to send audio in chunks. Only the first chunk should contain the audio file's header. `request` is an `HttpWebRequest` object that's connected to the appropriate REST endpoint. `audioFile` is the path to an audio file on disk.
156-
157-
```csharp
158-
var request = (HttpWebRequest)HttpWebRequest.Create(requestUri);
159-
request.SendChunked = true;
160-
request.Accept = @"application/json;text/xml";
161-
request.Method = "POST";
162-
request.ProtocolVersion = HttpVersion.Version11;
163-
request.Host = host;
164-
request.ContentType = @"audio/wav; codecs=audio/pcm; samplerate=16000";
165-
request.Headers["Ocp-Apim-Subscription-Key"] = "YOUR_RESOURCE_KEY";
166-
request.AllowWriteStreamBuffering = false;
167-
168-
using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
169-
{
170-
// Open a request stream and write 1,024-byte chunks in the stream one at a time.
171-
byte[] buffer = null;
172-
int bytesRead = 0;
173-
using (var requestStream = request.GetRequestStream())
174-
{
175-
// Read 1,024 raw bytes from the input audio file.
176-
buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
177-
while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
178-
{
179-
requestStream.Write(buffer, 0, bytesRead);
180-
}
181-
182-
requestStream.Flush();
183-
}
184-
}
185-
```
186-
187-
### Response parameters
188-
189-
Results are provided as JSON. The `simple` format includes the following top-level fields:
190-
191-
| Parameter | Description |
192-
|-----------|--------------|
193-
|`RecognitionStatus`|Status, such as `Success` for successful recognition. See the next table.|
194-
|`DisplayText`|The recognized text after capitalization, punctuation, inverse text normalization, and profanity masking. Present only on success. Inverse text normalization is conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith."|
195-
|`Offset`|The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.|
196-
|`Duration`|The duration (in 100-nanosecond units) of the recognized speech in the audio stream.|
197-
198-
The `RecognitionStatus` field might contain these values:
199-
200-
| Status | Description |
201-
|--------|-------------|
202-
| `Success` | The recognition was successful, and the `DisplayText` field is present. |
203-
| `NoMatch` | Speech was detected in the audio stream, but no words from the target language were matched. This status usually means that the recognition language is different from the language that the user is speaking. |
204-
| `InitialSilenceTimeout` | The start of the audio stream contained only silence, and the service timed out while waiting for speech. |
205-
| `BabbleTimeout` | The start of the audio stream contained only noise, and the service timed out while waiting for speech. |
206-
| `Error` | The recognition service encountered an internal error and could not continue. Try again if possible. |
207-
208-
> [!NOTE]
209-
> If the audio consists only of profanity, and the `profanity` query parameter is set to `remove`, the service does not return a speech result.
210-
211-
The `detailed` format includes additional forms of recognized results.
212-
When you're using the `detailed` format, `DisplayText` is provided as `Display` for each result in the `NBest` list.
213-
214-
The object in the `NBest` list can include:
215-
216-
| Parameter | Description |
217-
|-----------|-------------|
218-
| `Confidence` | The confidence score of the entry, from 0.0 (no confidence) to 1.0 (full confidence). |
219-
| `Lexical` | The lexical form of the recognized text: the actual words recognized. |
220-
| `ITN` | The inverse-text-normalized (ITN) or canonical form of the recognized text, with phone numbers, numbers, abbreviations ("doctor smith" to "dr smith"), and other transformations applied. |
221-
| `MaskedITN` | The ITN form with profanity masking applied, if requested. |
222-
| `Display` | The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as what `DisplayText` provides when the format is set to `simple`. |
223-
| `AccuracyScore` | Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. The accuracy score at the word and full-text levels is aggregated from the accuracy score at the phoneme level. |
224-
| `FluencyScore` | Fluency of the provided speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. |
225-
| `CompletenessScore` | Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input. |
226-
| `PronScore` | Overall score that indicates the pronunciation quality of the provided speech. This score is aggregated from `AccuracyScore`, `FluencyScore`, and `CompletenessScore` with weight. |
227-
| `ErrorType` | Value that indicates whether a word is omitted, inserted, or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion`, and `Mispronunciation`. |
228150

229-
### Sample responses
151+
## Sample responses
230152

231153
Here's a typical response for `simple` recognition:
232154

@@ -304,10 +226,88 @@ Here's a typical response for recognition with pronunciation assessment:
304226
}
305227
```
306228

229+
### Response properties
230+
231+
Results are provided as JSON. The `simple` format includes the following top-level fields:
232+
233+
| Property | Description |
234+
|-----------|--------------|
235+
|`RecognitionStatus`|Status, such as `Success` for successful recognition. See the next table.|
236+
|`DisplayText`|The recognized text after capitalization, punctuation, inverse text normalization, and profanity masking. Present only on success. Inverse text normalization is conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith."|
237+
|`Offset`|The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.|
238+
|`Duration`|The duration (in 100-nanosecond units) of the recognized speech in the audio stream.|
239+
240+
The `RecognitionStatus` field might contain these values:
241+
242+
| Status | Description |
243+
|--------|-------------|
244+
| `Success` | The recognition was successful, and the `DisplayText` field is present. |
245+
| `NoMatch` | Speech was detected in the audio stream, but no words from the target language were matched. This status usually means that the recognition language is different from the language that the user is speaking. |
246+
| `InitialSilenceTimeout` | The start of the audio stream contained only silence, and the service timed out while waiting for speech. |
247+
| `BabbleTimeout` | The start of the audio stream contained only noise, and the service timed out while waiting for speech. |
248+
| `Error` | The recognition service encountered an internal error and could not continue. Try again if possible. |
249+
250+
> [!NOTE]
251+
> If the audio consists only of profanity, and the `profanity` query parameter is set to `remove`, the service does not return a speech result.
252+
253+
The `detailed` format includes additional forms of recognized results.
254+
When you're using the `detailed` format, `DisplayText` is provided as `Display` for each result in the `NBest` list.
255+
256+
The object in the `NBest` list can include:
257+
258+
| Property | Description |
259+
|-----------|-------------|
260+
| `Confidence` | The confidence score of the entry, from 0.0 (no confidence) to 1.0 (full confidence). |
261+
| `Lexical` | The lexical form of the recognized text: the actual words recognized. |
262+
| `ITN` | The inverse-text-normalized (ITN) or canonical form of the recognized text, with phone numbers, numbers, abbreviations ("doctor smith" to "dr smith"), and other transformations applied. |
263+
| `MaskedITN` | The ITN form with profanity masking applied, if requested. |
264+
| `Display` | The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as what `DisplayText` provides when the format is set to `simple`. |
265+
| `AccuracyScore` | Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. The accuracy score at the word and full-text levels is aggregated from the accuracy score at the phoneme level. |
266+
| `FluencyScore` | Fluency of the provided speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. |
267+
| `CompletenessScore` | Completeness of the speech, determined by calculating the ratio of pronounced words to reference text input. |
268+
| `PronScore` | Overall score that indicates the pronunciation quality of the provided speech. This score is aggregated from `AccuracyScore`, `FluencyScore`, and `CompletenessScore` with weight. |
269+
| `ErrorType` | Value that indicates whether a word is omitted, inserted, or badly pronounced, compared to `ReferenceText`. Possible values are `None` (meaning no error on this word), `Omission`, `Insertion`, and `Mispronunciation`. |
270+
271+
## Chunked transfer
272+
273+
Chunked transfer (`Transfer-Encoding: chunked`) can help reduce recognition latency. It allows the Speech service to begin processing the audio file while it's transmitted. The REST API for short audio does not provide partial or interim results.
274+
275+
The following code sample shows how to send audio in chunks. Only the first chunk should contain the audio file's header. `request` is an `HttpWebRequest` object that's connected to the appropriate REST endpoint. `audioFile` is the path to an audio file on disk.
276+
277+
```csharp
278+
var request = (HttpWebRequest)HttpWebRequest.Create(requestUri);
279+
request.SendChunked = true;
280+
request.Accept = @"application/json;text/xml";
281+
request.Method = "POST";
282+
request.ProtocolVersion = HttpVersion.Version11;
283+
request.Host = host;
284+
request.ContentType = @"audio/wav; codecs=audio/pcm; samplerate=16000";
285+
request.Headers["Ocp-Apim-Subscription-Key"] = "YOUR_RESOURCE_KEY";
286+
request.AllowWriteStreamBuffering = false;
287+
288+
using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
289+
{
290+
// Open a request stream and write 1,024-byte chunks in the stream one at a time.
291+
byte[] buffer = null;
292+
int bytesRead = 0;
293+
using (var requestStream = request.GetRequestStream())
294+
{
295+
// Read 1,024 raw bytes from the input audio file.
296+
buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
297+
while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
298+
{
299+
requestStream.Write(buffer, 0, bytesRead);
300+
}
301+
302+
requestStream.Flush();
303+
}
304+
}
305+
```
306+
307+
[!INCLUDE [](includes/cognitive-services-speech-service-rest-auth.md)]
308+
307309
## Next steps
308310

309-
- [Create a free Azure account](https://azure.microsoft.com/free/cognitive-services/)
310-
- [Customize acoustic models](./how-to-custom-speech-train-model.md)
311-
- [Customize language models](./how-to-custom-speech-train-model.md)
311+
- [Customize speech models](./how-to-custom-speech-train-model.md)
312312
- [Get familiar with batch transcription](batch-transcription.md)
313313

0 commit comments

Comments
 (0)