Skip to content

Commit 5cebfe0

Browse files
llm-speech-transcription sdk (#47172)
* checkout from speech-transcription folder * update after renaming * refactoring * Updated dependency versions * fix spell check * Add spell check dictionary entries and version configuration * test cusomization * renamed setmodels to clarify * map from int to duration * add ci and pom.xml and retest customization * fix enable field * fix enable field * move AudioFileDetails into TranscriptionOptions and add 2 constructor overloads * update test * change constructor to transcribe(TranscriptionOptions options) * update test * fix linting * add codeowner * add codeowner * add release date * update changelog * add response * update sample, readme, tests * update tsp files * update version_client * regenerate sdk from typespec * adding javadoc for customized function * update samples * fix cspell * fix cspell * fix codeownerlint * fix cspell * fix codeowner lint * update tests * update broken links * checkout cspell * update readme and modify enabled property for EnhancedModeOptions * modify enhanceed mode customization * modify enhanceed mode customization and update readme * created a new service directory to put all the transcription SDK's under * update tsp commit * fetch previous recording * redo recording test * undo changes to pom.xml in previous package service
1 parent e7023eb commit 5cebfe0

File tree

57 files changed

+6673
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+6673
-0
lines changed

.github/CODEOWNERS

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,13 @@
262262
# ServiceLabel: %Cognitive - Speech
263263
# ServiceOwners: @rhurey
264264

265+
# PRLabel: %Speech Transcription
266+
/sdk/transcription/azure-ai-speech-transcription/ @amber-yujueWang @rhurey @xitzhang @Azure/azure-java-sdk
267+
268+
# ServiceLabel: %Speech Transcription
269+
# AzureSdkOwners: @amber-yujueWang @rhurey @xitzhang
270+
# ServiceOwners: @rhurey @xitzhang @amber-yujueWang
271+
265272
# PRLabel: %Cognitive - Text Analytics
266273
/sdk/textanalytics/ @samvaity @quentinRobinson @Azure/azure-java-sdk
267274

eng/versioning/version_client.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ com.azure:azure-ai-openai-realtime;1.0.0-beta.1;1.0.0-beta.1
5353
com.azure:azure-ai-openai-stainless;1.0.0-beta.1;1.0.0-beta.1
5454
com.azure:azure-ai-personalizer;1.0.0-beta.1;1.0.0-beta.2
5555
com.azure:azure-ai-projects;1.0.0-beta.3;1.0.0-beta.4
56+
com.azure:azure-ai-speech-transcription;1.0.0-beta.1;1.0.0-beta.1
5657
com.azure:azure-ai-textanalytics;5.5.11;5.6.0-beta.1
5758
com.azure:azure-ai-textanalytics-perf;1.0.0-beta.1;1.0.0-beta.1
5859
com.azure:azure-ai-translation-text;1.1.7;2.0.0-beta.1

pom.xml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,7 @@
265265
<module>sdk/timeseriesinsights</module>
266266
<module>sdk/tools</module>
267267
<module>sdk/trafficmanager</module>
268+
<module>sdk/transcription</module>
268269
<module>sdk/translation</module>
269270
<module>sdk/trustedsigning</module>
270271
<module>sdk/vision</module>
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Release History
2+
3+
## 1.0.0-beta.1 (2025-12-19)
4+
5+
### Features Added
6+
7+
- Initial release of Azure AI Speech Transcription client library for Java.
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
# Azure AI Speech Transcription client library for Java
2+
3+
The Azure AI Speech Transcription client library provides a simple and efficient way to convert audio to text using Azure Cognitive Services. This library enables you to transcribe audio with features like speaker diarization, profanity filtering, and phrase hints for improved accuracy.
4+
5+
## Documentation
6+
7+
Various documentation is available to help you get started:
8+
9+
- [API reference documentation][docs]
10+
- [Product documentation][product_documentation]
11+
- [Azure Speech Service documentation](https://learn.microsoft.com/azure/ai-services/speech-service/)
12+
13+
## Getting started
14+
15+
### Prerequisites
16+
17+
- [Java Development Kit (JDK)][jdk] with version 8 or above
18+
- [Azure Subscription][azure_subscription]
19+
- An [Azure Speech resource](https://learn.microsoft.com/azure/ai-services/speech-service/overview#try-the-speech-service-for-free) or [Cognitive Services multi-service resource](https://learn.microsoft.com/azure/ai-services/multi-service-resource)
20+
21+
### Adding the package to your product
22+
23+
[//]: # ({x-version-update-start;com.azure:azure-ai-speech-transcription;current})
24+
```xml
25+
<dependency>
26+
<groupId>com.azure</groupId>
27+
<artifactId>azure-ai-speech-transcription</artifactId>
28+
<version>1.0.0-beta.1</version>
29+
</dependency>
30+
```
31+
[//]: # ({x-version-update-end})
32+
33+
#### Optional: For Entra ID Authentication
34+
35+
If you plan to use Entra ID authentication (recommended for production), also add the `azure-identity` dependency:
36+
37+
```xml
38+
<dependency>
39+
<groupId>com.azure</groupId>
40+
<artifactId>azure-identity</artifactId>
41+
<version>1.18.1</version>
42+
</dependency>
43+
```
44+
45+
### Authentication
46+
47+
Azure Speech Transcription supports two authentication methods:
48+
49+
#### Option 1: API Key Authentication (Subscription Key)
50+
51+
You can find your Speech resource's API key in the [Azure Portal](https://portal.azure.com) or by using the Azure CLI:
52+
53+
```bash
54+
az cognitiveservices account keys list --name <your-resource-name> --resource-group <your-resource-group>
55+
```
56+
57+
Once you have an API key, you can authenticate using `KeyCredential`:
58+
59+
```java
60+
import com.azure.core.credential.KeyCredential;
61+
62+
TranscriptionClient client = new TranscriptionClientBuilder()
63+
.endpoint("https://<your-resource-name>.cognitiveservices.azure.com/")
64+
.credential(new KeyCredential("<your-api-key>"))
65+
.buildClient();
66+
```
67+
68+
#### Option 2: Entra ID OAuth2 Authentication (Recommended for Production)
69+
70+
For production scenarios, it's recommended to use Entra ID authentication with managed identities or service principals. This provides better security and easier credential management.
71+
72+
```java
73+
import com.azure.identity.DefaultAzureCredential;
74+
import com.azure.identity.DefaultAzureCredentialBuilder;
75+
76+
// Use DefaultAzureCredential which works with managed identities, service principals, Azure CLI, etc.
77+
DefaultAzureCredential credential = new DefaultAzureCredentialBuilder().build();
78+
79+
TranscriptionClient client = new TranscriptionClientBuilder()
80+
.endpoint("https://<your-resource-name>.cognitiveservices.azure.com/")
81+
.credential(credential)
82+
.buildClient();
83+
```
84+
85+
**Note:** To use Entra ID authentication, you need to:
86+
1. Add the `azure-identity` dependency to your project
87+
2. Assign the appropriate role (e.g., "Cognitive Services User") to your managed identity or service principal
88+
3. Ensure your Cognitive Services resource has Entra ID authentication enabled
89+
90+
For more information on Entra ID authentication, see:
91+
- [Authenticate with Azure Identity](https://learn.microsoft.com/azure/developer/java/sdk/identity)
92+
- [Azure Cognitive Services authentication](https://learn.microsoft.com/azure/ai-services/authentication)
93+
94+
## Key concepts
95+
96+
### TranscriptionClient
97+
98+
The `TranscriptionClient` is the primary interface for interacting with the Speech Transcription service. It provides methods to transcribe audio to text.
99+
100+
### TranscriptionAsyncClient
101+
102+
The `TranscriptionAsyncClient` provides asynchronous methods for transcribing audio, allowing non-blocking operations that return reactive types.
103+
104+
### Audio Formats
105+
106+
The service supports various audio formats including WAV, MP3, OGG, and more. Audio must be:
107+
108+
- Shorter than 2 hours in duration
109+
- Smaller than 250 MB in size
110+
111+
### Transcription Options
112+
113+
You can customize transcription with options like:
114+
115+
- **Profanity filtering**: Control how profanity is handled in transcriptions
116+
- **Speaker diarization**: Identify different speakers in multi-speaker audio
117+
- **Phrase lists**: Provide domain-specific phrases to improve accuracy
118+
- **Language detection**: Automatically detect the spoken language
119+
- **Enhanced mode**: Improve transcription quality with custom prompts, translation, and task-specific configurations
120+
121+
## Examples
122+
123+
### Transcribe an audio file
124+
125+
```java com.azure.ai.speech.transcription.readme
126+
TranscriptionClient client = new TranscriptionClientBuilder()
127+
.endpoint("https://<your-resource-name>.cognitiveservices.azure.com/")
128+
.credential(new KeyCredential("<your-api-key>"))
129+
.buildClient();
130+
131+
try {
132+
// Read audio file
133+
byte[] audioData = Files.readAllBytes(Paths.get("path/to/audio.wav"));
134+
135+
// Create audio file details
136+
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
137+
138+
// Create transcription options
139+
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails);
140+
141+
// Transcribe audio
142+
TranscriptionResult result = client.transcribe(options);
143+
144+
// Process results
145+
System.out.println("Duration: " + result.getDuration() + " ms");
146+
result.getCombinedPhrases().forEach(phrase -> {
147+
System.out.println("Channel " + phrase.getChannel() + ": " + phrase.getText());
148+
});
149+
} catch (Exception e) {
150+
System.err.println("Error during transcription: " + e.getMessage());
151+
}
152+
```
153+
154+
### Transcribe using audio URL
155+
156+
You can transcribe audio directly from a URL without downloading the file first:
157+
158+
```java readme-sample-transcribeWithAudioUrl
159+
TranscriptionClient client = new TranscriptionClientBuilder()
160+
.endpoint("https://<your-resource-name>.cognitiveservices.azure.com/")
161+
.credential(new KeyCredential("<your-api-key>"))
162+
.buildClient();
163+
164+
// Create transcription options with audio URL
165+
TranscriptionOptions options = new TranscriptionOptions("https://example.com/audio.wav");
166+
167+
// Transcribe audio
168+
TranscriptionResult result = client.transcribe(options);
169+
170+
// Process results
171+
result.getCombinedPhrases().forEach(phrase -> {
172+
System.out.println(phrase.getText());
173+
});
174+
```
175+
176+
### Transcribe with multi-language support
177+
178+
The service can automatically detect and transcribe multiple languages within the same audio file.
179+
180+
```java com.azure.ai.speech.transcription.transcriptionoptions.multilanguage
181+
byte[] audioData = Files.readAllBytes(Paths.get("path/to/audio.wav"));
182+
183+
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
184+
185+
// Configure transcription WITHOUT specifying locales
186+
// This allows the service to auto-detect and transcribe multiple languages
187+
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails);
188+
189+
TranscriptionResult result = client.transcribe(options);
190+
191+
result.getPhrases().forEach(phrase -> {
192+
System.out.println("Language: " + phrase.getLocale());
193+
System.out.println("Text: " + phrase.getText());
194+
});
195+
```
196+
197+
### Transcribe with enhanced mode
198+
199+
Enhanced mode provides advanced features to improve transcription accuracy with custom prompts. Enhanced mode is automatically enabled when you create an `EnhancedModeOptions` instance.
200+
201+
```java com.azure.ai.speech.transcription.transcriptionoptions.enhancedmode
202+
byte[] audioData = Files.readAllBytes(Paths.get("path/to/audio.wav"));
203+
204+
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
205+
206+
// Enhanced mode is automatically enabled
207+
EnhancedModeOptions enhancedMode = new EnhancedModeOptions()
208+
.setTask("transcribe")
209+
.setPrompts(java.util.Arrays.asList("Output must be in lexical format."));
210+
211+
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
212+
.setEnhancedModeOptions(enhancedMode);
213+
214+
TranscriptionResult result = client.transcribe(options);
215+
216+
System.out.println("Transcription: " + result.getCombinedPhrases().get(0).getText());
217+
```
218+
219+
### Transcribe with phrase list
220+
221+
You can use a phrase list to improve recognition accuracy for specific terms.
222+
223+
```java com.azure.ai.speech.transcription.transcriptionoptions.phraselist
224+
byte[] audioData = Files.readAllBytes(Paths.get("path/to/audio.wav"));
225+
226+
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
227+
228+
PhraseListOptions phraseListOptions = new PhraseListOptions()
229+
.setPhrases(java.util.Arrays.asList("Azure", "Cognitive Services"))
230+
.setBiasingWeight(5.0);
231+
232+
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
233+
.setPhraseListOptions(phraseListOptions);
234+
235+
TranscriptionResult result = client.transcribe(options);
236+
237+
result.getCombinedPhrases().forEach(phrase -> {
238+
System.out.println(phrase.getText());
239+
});
240+
```
241+
242+
### Service API versions
243+
244+
The client library targets the latest service API version by default.
245+
The service client builder accepts an optional service API version parameter to specify which API version to communicate.
246+
247+
#### Select a service API version
248+
249+
You have the flexibility to explicitly select a supported service API version when initializing a service client via the service client builder.
250+
This ensures that the client can communicate with services using the specified API version.
251+
252+
When selecting an API version, it is important to verify that there are no breaking changes compared to the latest API version.
253+
If there are significant differences, API calls may fail due to incompatibility.
254+
255+
Always ensure that the chosen API version is fully supported and operational for your specific use case and that it aligns with the service's versioning policy.
256+
257+
## Troubleshooting
258+
259+
### Enable client logging
260+
261+
You can enable logging to debug issues with the client library. The Azure client libraries for Java use the SLF4J logging facade. You can configure logging by adding a logging dependency and configuration file. For more information, see the [logging documentation](https://learn.microsoft.com/azure/developer/java/sdk/logging-overview).
262+
263+
### Common issues
264+
265+
#### Authentication errors
266+
267+
- Verify that your API key is correct
268+
- Ensure your endpoint URL matches your Azure resource region
269+
270+
#### Audio format errors
271+
272+
- Verify your audio file is in a supported format
273+
- Ensure the audio file size is under 250 MB and duration is under 2 hours
274+
275+
### Getting help
276+
277+
If you encounter issues:
278+
279+
- Check the [troubleshooting guide](https://learn.microsoft.com/azure/ai-services/speech-service/troubleshooting)
280+
- Search for existing issues or create a new one on [GitHub](https://github.com/Azure/azure-sdk-for-java/issues)
281+
- Ask questions on [Stack Overflow](https://stackoverflow.com/questions/tagged/azure-java-sdk) with the `azure-java-sdk` tag
282+
283+
## Next steps
284+
285+
- Explore the [samples](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/transcription/azure-ai-speech-transcription/src/samples) for more examples
286+
- Learn more about [Azure Speech Service](https://learn.microsoft.com/azure/ai-services/speech-service/)
287+
- Review the [API reference documentation][docs] for detailed information about classes and methods
288+
289+
## Contributing
290+
291+
292+
For details on contributing to this repository, see the [contributing guide](https://github.com/Azure/azure-sdk-for-java/blob/main/CONTRIBUTING.md).
293+
294+
1. Fork it
295+
1. Create your feature branch (`git checkout -b my-new-feature`)
296+
1. Commit your changes (`git commit -am 'Add some feature'`)
297+
1. Push to the branch (`git push origin my-new-feature`)
298+
1. Create new Pull Request
299+
300+
<!-- LINKS -->
301+
[product_documentation]: https://learn.microsoft.com/azure/ai-services/speech-service/
302+
[docs]: https://azure.github.io/azure-sdk-for-java/
303+
[jdk]: https://learn.microsoft.com/azure/developer/java/fundamentals/
304+
[azure_subscription]: https://azure.microsoft.com/free/
305+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"AssetsRepo":"Azure/azure-sdk-assets","AssetsRepoPrefixPath":"java","TagPrefix":"java/transcription/azure-ai-speech-transcription","Tag": "java/transcription/azure-ai-speech-transcription_c82ca4aec0"}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"version": "0.2",
3+
"language": "en",
4+
"words": [
5+
"azuread",
6+
"BYOD",
7+
"BYOS",
8+
"dexec",
9+
"diarization",
10+
"doméstica",
11+
"empleada",
12+
"habitación",
13+
"misrecognized",
14+
"Mundo"
15+
]
16+
}

0 commit comments

Comments
 (0)