Skip to content

Commit 832e11c

Browse files
committed
fast transcription API preview
1 parent 252efdd commit 832e11c

File tree

6 files changed

+286
-2
lines changed

6 files changed

+286
-2
lines changed
Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
---
2+
title: Create a batch transcription - Speech service
3+
titleSuffix: Azure AI services
4+
description: Learn how to use Azure AI Speech for fast transcriptions, where you submit audio get the transcription results much faster than real-time audio.
5+
manager: nitinme
6+
author: eric-urban
7+
ms.author: eur
8+
ms.service: azure-ai-speech
9+
ms.topic: how-to
10+
ms.date: 4/15/2024
11+
# Customer intent: As a user who implements audio transcription, I want create transcriptions as quickly as possible.
12+
---
13+
14+
# Use the fast transcription API (preview) with Azure AI Speech
15+
16+
[!INCLUDE [Feature preview](./includes/previews/preview-feature.md)]
17+
18+
Fast transcription API is used to transcribe audio files with returning results synchronously and much faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:
19+
20+
- Quick audio or video transcription, subtitles, and edit.
21+
- Video dubbing
22+
23+
> [!NOTE]
24+
> Fast transcription API is only available via the speech to text REST API version 3.3.
25+
26+
## Prerequisites
27+
28+
- A Speech resource in one of the regions where the fast transcription API is available. The supported regions are: Australia East, Brazil South, Central India, East US, East US 2, Japan East, North Central US, North Europe, South Central US, Southeast Asia, Sweden Central, West Europe, West US, and West US 2. For more information about regions supported for other Speech service features, see [Speech service regions](./speech-service-regions.md).
29+
- An audio file (less than 2 hours long and less than 200 MB in size) in one of the supported formats and codecs: WAV, MP3, OPUS/OGG, FLAC, WMA, AAC, ALAW in WAV container, MULAW in WAV container, AMR, WebM, M4A, and SPEEX.
30+
31+
## Use the fast transcription API
32+
33+
Construct the request body according to the following instructions:
34+
35+
- Set the required `inputLocales` property. This value should match the expected locale of the audio data to transcribe. The supported locales are: en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN.
36+
- Optionally, set the `wordLevelTimestampsEnabled` property to `true` to enable word-level timestamps in the transcription results. The default value is `false`.
37+
- Optionally, set the `profanityFilterMode` property to specify how to handle profanity in recognition results. Accepted values are `None` to disable profanity filtering, `Masked` to replace profanity with asterisks, `Removed` to remove all profanity from the result, or `Tags` to add profanity tags. The default value is `Masked`.
38+
- Optionally, set the `channels` property to specify the number of channels in the audio file. The possible values in the `channels` list are 0 and 1. The default value is `[0,1]`. If the audio file contains multiple channels, set this property to the number of channels in the audio file.
39+
40+
> [!NOTE]
41+
> The `wordLevelTimestampsEnabled`, `profanityFilterMode`, and `channels` properties work the same way as via the [batch transcription API](./batch-transcription.md).
42+
43+
Make a multipart/form-data POST request to the `syncTranscriptions` endpoint with the audio file and the request body properties. The following example shows how to create a transcription using the fast transcription API.
44+
45+
- Replace `YourSubscriptionKey` with your Speech resource key.
46+
- Replace `YourServiceRegion` with your Speech resource region.
47+
- Replace `YourAudioFile` with the path to your audio file.
48+
- Set the form definition properties as previously described.
49+
50+
```azurecli-interactive
51+
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.3/syncTranscriptions' \
52+
--header 'Content-Type: multipart/form-data' \
53+
--header 'Accept: application/json' \
54+
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
55+
--form 'audio=@"YourAudioFile"' \
56+
--form 'definition="{
57+
\"inputLocales\":[\"en-US\"],
58+
\"wordLevelTimestampsEnabled\":true,
59+
\"profanityFilterMode\": \"Masked\",
60+
\"channels\": [0,1]}"'
61+
```
62+
63+
The response will include `timestamp`, `durationInTicks`, `duration`, and more.
64+
- The `combinedRecognizedPhrases` property contains the full transcriptions for each channel separately. For example, everything the first speaker said is in the first element of the `combinedRecognizedPhrases` array, and everything the second speaker said is in the second element of the array.
65+
- Since we specified `wordLevelTimestampsEnabled` as `true`, the response will include word-level timestamps.
66+
-
67+
68+
```json
69+
{
70+
"timestamp": "2024-04-26T06:14:26.3605217Z",
71+
"durationInTicks": 1850790625,
72+
"duration": "PT3M5.0790625S",
73+
"combinedRecognizedPhrases": [
74+
{
75+
"channel": 0,
76+
"display": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
77+
},
78+
{
79+
"channel": 1,
80+
"display": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's [email protected]. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
81+
}
82+
],
83+
"recognizedPhrases": [
84+
{
85+
"recognitionStatus": "Success",
86+
"channel": 0,
87+
"offset": "PT0.72S",
88+
"duration": "PT0.48S",
89+
"offsetInTicks": 7200000,
90+
"durationInTicks": 4800000,
91+
"nBest": [
92+
{
93+
"confidence": 0.9177142,
94+
"display": "Hello.",
95+
"displayWords": [
96+
{
97+
"displayText": "Hello.",
98+
"offset": "PT0.72S",
99+
"duration": "PT0.48S",
100+
"offsetInTicks": 7200000,
101+
"durationInTicks": 4800000
102+
}
103+
]
104+
}
105+
]
106+
},
107+
{
108+
"recognitionStatus": "Success",
109+
"channel": 0,
110+
"offset": "PT1.2S",
111+
"duration": "PT1.12S",
112+
"offsetInTicks": 12000000,
113+
"durationInTicks": 11200000,
114+
"nBest": [
115+
{
116+
"confidence": 0.9177142,
117+
"display": "Thank you for calling Contoso.",
118+
"displayWords": [
119+
{
120+
"displayText": "Thank",
121+
"offset": "PT1.2S",
122+
"duration": "PT0.2S",
123+
"offsetInTicks": 12000000,
124+
"durationInTicks": 2000000
125+
},
126+
{
127+
"displayText": "you",
128+
"offset": "PT1.4S",
129+
"duration": "PT0.08S",
130+
"offsetInTicks": 14000000,
131+
"durationInTicks": 800000
132+
},
133+
{
134+
"displayText": "for",
135+
"offset": "PT1.48S",
136+
"duration": "PT0.12S",
137+
"offsetInTicks": 14800000,
138+
"durationInTicks": 1200000
139+
},
140+
{
141+
"displayText": "calling",
142+
"offset": "PT1.6S",
143+
"duration": "PT0.24S",
144+
"offsetInTicks": 16000000,
145+
"durationInTicks": 2400000
146+
},
147+
{
148+
"displayText": "Contoso.",
149+
"offset": "PT1.84S",
150+
"duration": "PT0.48S",
151+
"offsetInTicks": 18400000,
152+
"durationInTicks": 4800000
153+
}
154+
]
155+
}
156+
]
157+
},
158+
// More transcription results removed for brevity
159+
// {...},
160+
{
161+
"recognitionStatus": "Success",
162+
"channel": 1,
163+
"offset": "PT2M59.88S",
164+
"duration": "PT0.6S",
165+
"offsetInTicks": 1798800000,
166+
"durationInTicks": 6000000,
167+
"nBest": [
168+
{
169+
"confidence": 0.90407056,
170+
"display": "Thanks so much.",
171+
"displayWords": [
172+
{
173+
"displayText": "Thanks",
174+
"offset": "PT2M59.88S",
175+
"duration": "PT0.2S",
176+
"offsetInTicks": 1798800000,
177+
"durationInTicks": 2000000
178+
},
179+
{
180+
"displayText": "so",
181+
"offset": "PT3M0.08S",
182+
"duration": "PT0.08S",
183+
"offsetInTicks": 1800800000,
184+
"durationInTicks": 800000
185+
},
186+
{
187+
"displayText": "much.",
188+
"offset": "PT3M0.16S",
189+
"duration": "PT0.32S",
190+
"offsetInTicks": 1801600000,
191+
"durationInTicks": 3200000
192+
}
193+
]
194+
}
195+
]
196+
}
197+
]
198+
}
199+
```
200+
201+
## Compare with the real-time API
202+
203+
You can compare transcription results with the [speech to text real-time API](./rest-speech-to-text-short.md).
204+
- The real-time API is limited to 60 seconds of audio. The fast transcription API is designed for longer audio files and returns results much faster than real-time audio.
205+
- The real-time API doesn't support channel separation. The fast transcription API supports channel separation and returns results for each channel separately.
206+
207+
Here's an example request:
208+
209+
- Replace `YourSubscriptionKey` with your Speech resource key.
210+
- Replace `YourServiceRegion` with your Speech resource region.
211+
- Replace `YourAudioFile` with the path to your audio file.
212+
213+
```azurecli-interactive
214+
curl --location --request POST \
215+
"https://YourServiceRegion.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US&format=detailed" \
216+
--header "Ocp-Apim-Subscription-Key: YourSubscriptionKey" \
217+
--header "Content-Type: audio/wav" \
218+
--data-binary YourAudioFile
219+
```
220+
221+
Here's an example transcription response using the [speech to text real-time API](./rest-speech-to-text-short.md). Only the first 60 seconds of the provided audio file is transcribed to text.
222+
223+
```json
224+
{
225+
"RecognitionStatus": "Success",
226+
"Offset": 7500000,
227+
"Duration": 538000000,
228+
"NBest": [
229+
{
230+
"Confidence": 0.8452396,
231+
"Lexical": "hello thank you for calling contoso who am i speaking with today hi my name is mary rondo i'm trying to enroll myself with contoso hi mary uh are you calling because you need health insurance yes yeah i'm calling to sign up for insurance great uh if you can answer a few questions we can get you signed up in the jiffy OK so what's your full name so mary beth rondo last name is R like romeo O like ocean N like nancy DD like dog and O like ocean again rondo got it and what's the best callback number in case we get disconnected i only have a cell phone so i can give you that yeah that'll be fine sure so it's two three four five five four and then nine three one two got it so to confirm it's two three four five five four nine three one two",
232+
"ITN": "hello thank you for calling contoso who am i speaking with today hi my name is mary rondo i'm trying to enroll myself with contoso hi mary uh are you calling because you need health insurance yes yeah i'm calling to sign up for insurance great uh if you can answer a few questions we can get you signed up in the jiffy OK so what's your full name so mary beth rondo last name is R like romeo O like ocean N like nancy DD like dog and O like ocean again rondo got it and what's the best callback number in case we get disconnected i only have a cell phone so i can give you that yeah that'll be fine sure so it's 234554 and then 9312 got it so to confirm it's 234-554-9312",
233+
"MaskedITN": "hello thank you for calling contoso who am i speaking with today hi my name is mary rondo i'm trying to enroll myself with contoso hi mary uh are you calling because you need health insurance yes yeah i'm calling to sign up for insurance great uh if you can answer a few questions we can get you signed up in the jiffy ok so what's your full name so mary beth rondo last name is r like romeo o like ocean n like nancy dd like dog and o like ocean again rondo got it and what's the best callback number in case we get disconnected i only have a cell phone so i can give you that yeah that'll be fine sure so it's 234554 and then 9312 got it so to confirm it's 234-554-9312",
234+
"Display": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Uh, are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh, if you can answer a few questions, we can get you signed up in the jiffy. OK. So what's your full name? So Mary Beth Rondo last name is R like Romeo, O like Ocean, N like Nancy, DD like Dog, and O like Ocean. Again, Rondo got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yeah, that'll be fine. Sure. So it's 234554 and then 9312. Got it. So to confirm, it's 234-554-9312."
235+
}
236+
],
237+
"DisplayText": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Uh, are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh, if you can answer a few questions, we can get you signed up in the jiffy. OK.So what's your full name?So Mary Beth Rondo last name is R like Romeo, O like Ocean, N like Nancy, DD like Dog, and O like Ocean. Again, Rondo got it. And what's the best callback number in case we get disconnected?I only have a cell phone, so I can give you that. Yeah, that'll be fine. Sure. So it's 234554 and then 9312. Got it. So to confirm, it's 234-554-9312."
238+
}
239+
```
240+
241+
## Related content
242+
243+
- [Speech to text real-time API for short audio](./rest-speech-to-text-short.md)
244+
- [Batch transcription API](./batch-transcription.md)
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: include file
3+
description: include file
4+
author: eric-urban
5+
ms.author: eur
6+
ms.service: azure-ai-speech
7+
ms.topic: include
8+
ms.date: 1/10/2024
9+
ms.custom: include
10+
---
11+
12+
> [!NOTE]
13+
> This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

articles/ai-services/speech-service/includes/release-notes/release-notes-stt.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@ ms.date: 3/13/2024
66
ms.author: eur
77
---
88

9+
### May 2024 release
10+
11+
#### Fast transcription API (Preview)
12+
13+
The fast transcription API is now available in preview. Fast transcription API is used to transcribe audio files with returning results synchronously and much faster than real-time audio. For more information, see the [fast transcription API guide](../../fast-transcription.md).
14+
915
### April 2024 release
1016

1117
#### Real-time speech to text with diariazation (GA)

articles/ai-services/speech-service/speech-services-quotas-and-limits.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,13 @@ You can use real-time speech to text with the [Speech SDK](speech-sdk.md) or the
4242
| Concurrent request limit - custom endpoint | 1 <br/><br/>This limit isn't adjustable. | 100 (default value)<br/><br/>The rate is adjustable for Standard (S0) resources. See [more explanations](#detailed-description-quota-adjustment-and-best-practices), [best practices](#general-best-practices-to-mitigate-throttling-during-autoscaling), and [adjustment instructions](#speech-to-text-increase-real-time-speech-to-text-concurrent-request-limit). |
4343
| Max audio length for [real-time diarization](./get-started-stt-diarization.md). | N/A | 240 minutes per file |
4444

45+
#### Fast transcription
46+
47+
| Quota | Free (F0) | Standard (S0) |
48+
|-----|-----|-----|
49+
| Max audio input file size | N/A | 200 MB |
50+
| Max audio length | N/A | 120 minutes per file |
51+
4552
#### Batch transcription
4653

4754
| Quota | Free (F0) | Standard (S0) |

articles/ai-services/speech-service/speech-to-text.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,19 @@ With real-time speech to text, the audio is transcribed as speech is recognized
3131

3232
Real-time speech to text is available via the [Speech SDK](speech-sdk.md) and the [Speech CLI](spx-overview.md).
3333

34-
## Batch transcription
34+
## Fast transcription API (Preview)
35+
36+
Fast transcription API is used to transcribe audio files with returning results synchronously and much faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:
37+
38+
- Quick audio or video transcription, subtitles, and edit.
39+
- Video dubbing
40+
41+
> [!NOTE]
42+
> Fast transcription API is only available via the speech to text REST API version 3.3.
43+
44+
To get started with fast transcription, see [use the fast transcription API (preview)](fast-transcription-create.md).
45+
46+
## Batch transcription API
3547

3648
[Batch transcription](batch-transcription.md) is used to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. Use batch transcription for applications that need to transcribe audio in bulk such as:
3749
- Transcriptions, captions, or subtitles for prerecorded audio

articles/ai-services/speech-service/toc.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,9 @@ items:
6565
href: get-speech-recognition-results.md
6666
- name: Real-time diarization quickstart
6767
href: get-started-stt-diarization.md
68-
- name: Batch transcription
68+
- name: Use the fast transcription API (Preview)
69+
href: fast-transcription-create.md
70+
- name: Batch transcription API
6971
items:
7072
- name: What is batch transcription?
7173
href: batch-transcription.md

0 commit comments

Comments
 (0)