|
| 1 | +--- |
| 2 | +title: Embedded Speech - Speech service |
| 3 | +titleSuffix: Azure Cognitive Services |
| 4 | +description: Embedded Speech is designed for on-device scenarios where cloud connectivity is intermittent or unavailable. |
| 5 | +services: cognitive-services |
| 6 | +author: eric-urban |
| 7 | +manager: nitinme |
| 8 | +ms.service: cognitive-services |
| 9 | +ms.subservice: speech-service |
| 10 | +ms.topic: how-to |
| 11 | +ms.date: 10/31/2022 |
| 12 | +ms.author: eur |
| 13 | +zone_pivot_groups: programming-languages-set-thirteen |
| 14 | +--- |
| 15 | + |
| 16 | +# Embedded Speech (preview) |
| 17 | + |
| 18 | +Embedded Speech is designed for on-device [speech-to-text](speech-to-text.md) and [text-to-speech](text-to-speech.md) scenarios where cloud connectivity is intermittent or unavailable. For example, you can use embedded speech in medical equipment, a voice enabled air conditioning unit, or a car that might travel out of range. You can also develop hybrid cloud and offline solutions. For scenarios where your devices must be in a secure environment like a bank or government entity, you should first consider [disconnected containers](/azure/cognitive-services/containers/disconnected-containers). |
| 19 | + |
| 20 | +> [!IMPORTANT] |
| 21 | +> Microsoft limits access to embedded speech. You can apply for access through the Azure Cognitive Services [embedded speech limited access review](https://aka.ms/csgate-embedded-speech). For more information, see [Limited access for embedded speech](/legal/cognitive-services/speech-service/embedded-speech/limited-access-embedded-speech?context=/azure/cognitive-services/speech-service/context/context). |
| 22 | +
|
| 23 | +## Platform requirements |
| 24 | + |
| 25 | +Embedded speech is included with the Speech SDK (version 1.24.1 and higher) for C#, C++, and Java. Refer to the general [Speech SDK installation requirements](quickstarts/setup-platform.md) for programming language and target platform specific details. |
| 26 | + |
| 27 | +**Choose your target environment** |
| 28 | + |
| 29 | +# [Android](#tab/android) |
| 30 | + |
| 31 | +Requires Android 7.0 (API level 24) or higher on ARM64 (`arm64-v8a`) or ARM32 (`armeabi-v7a`) hardware. |
| 32 | + |
| 33 | +Embedded TTS with neural voices is only supported on ARM64. |
| 34 | + |
| 35 | +# [Linux](#tab/linux) |
| 36 | + |
| 37 | +Requires Linux on x64, ARM64, or ARM32 hardware with [supported Linux distributions](quickstarts/setup-platform.md?tabs=linux). |
| 38 | + |
| 39 | +Embedded speech isn't supported on RHEL/CentOS 7. |
| 40 | + |
| 41 | +Embedded TTS with neural voices isn't supported on ARM32. |
| 42 | + |
| 43 | +# [macOS](#tab/macos) |
| 44 | + |
| 45 | +Requires 10.14 or newer on x64 or ARM64 hardware. |
| 46 | + |
| 47 | +# [Windows](#tab/windows) |
| 48 | + |
| 49 | +Requires Windows 10 or newer on x64 or ARM64 hardware. |
| 50 | + |
| 51 | +The latest [Microsoft Visual C++ Redistributable for Visual Studio 2015-2022](/cpp/windows/latest-supported-vc-redist?view=msvc-170&preserve-view=true) must be installed regardless of the programming language used with the Speech SDK. |
| 52 | + |
| 53 | +The Speech SDK for Java doesn't support Windows on ARM64. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Limitations |
| 58 | + |
| 59 | +Embedded speech is only available with C#, C++, and Java SDKs. The other Speech SDKs, Speech CLI, and REST APIs don't support embedded speech. |
| 60 | + |
| 61 | +Embedded speech recognition only supports mono 16 bit, 16-kHz PCM-encoded WAV audio. |
| 62 | + |
| 63 | +Embedded neural voices only support 24-kHz sample rate. |
| 64 | + |
| 65 | +## Models and voices |
| 66 | + |
| 67 | +For embedded speech, you'll need to download the speech recognition models for [speech-to-text](speech-to-text.md) and voices for [text-to-speech](text-to-speech.md). Instructions will be provided upon successful completion of the [limited access review](https://aka.ms/csgate-embedded-speech) process. |
| 68 | + |
| 69 | +## Embedded speech configuration |
| 70 | + |
| 71 | +For cloud connected applications, as shown in most Speech SDK samples, you use the `SpeechConfig` object with a Speech resource key and region. For embedded speech, you don't use a Speech resource. Instead of a cloud resource, you use the [models and voices](#models-and-voices) that you downloaded to your local device. |
| 72 | + |
| 73 | +Use the `EmbeddedSpeechConfig` object to set the location of the models or voices. If your application is used for both speech-to-text and text-to-speech, you can use the same `EmbeddedSpeechConfig` object to set the location of the models and voices. |
| 74 | + |
| 75 | +::: zone pivot="programming-language-csharp" |
| 76 | + |
| 77 | +```csharp |
| 78 | +// Provide the location of the models and voices. |
| 79 | +List<string> paths = new List<string>(); |
| 80 | +paths.Add("C:\\dev\\embedded-speech\\stt-models"); |
| 81 | +paths.Add("C:\\dev\\embedded-speech\\tts-voices"); |
| 82 | +var embeddedSpeechConfig = EmbeddedSpeechConfig.FromPaths(paths.ToArray()); |
| 83 | + |
| 84 | +// For speech-to-text |
| 85 | +embeddedSpeechConfig.SetSpeechRecognitionModel( |
| 86 | + "Microsoft Speech Recognizer en-US FP Model V8", |
| 87 | + Environment.GetEnvironmentVariable("MODEL_KEY")); |
| 88 | + |
| 89 | +// For text-to-speech |
| 90 | +embeddedSpeechConfig.SetSpeechSynthesisVoice( |
| 91 | + "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)", |
| 92 | + Environment.GetEnvironmentVariable("VOICE_KEY")); |
| 93 | +embeddedSpeechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm); |
| 94 | +``` |
| 95 | + |
| 96 | +You can find ready to use embedded speech samples at [GitHub](https://aka.ms/csspeech/samples). |
| 97 | + |
| 98 | +- [C# (.NET 6.0)](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/csharp/dotnetcore/embedded-speech) |
| 99 | +- [C# for Unity](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/csharp/unity/embedded-speech) |
| 100 | +::: zone-end |
| 101 | + |
| 102 | +::: zone pivot="programming-language-cpp" |
| 103 | + |
| 104 | +> [!TIP] |
| 105 | +> The `GetEnvironmentVariable` function is defined in the [speech-to-text quickstart](get-started-speech-to-text.md) and [text-to-speech quickstart](get-started-text-to-speech.md). |
| 106 | +
|
| 107 | +```cpp |
| 108 | +// Provide the location of the models and voices. |
| 109 | +vector<string> paths; |
| 110 | +paths.push_back("C:\\dev\\embedded-speech\\stt-models"); |
| 111 | +paths.push_back("C:\\dev\\embedded-speech\\tts-voices"); |
| 112 | +auto embeddedSpeechConfig = EmbeddedSpeechConfig::FromPaths(paths); |
| 113 | + |
| 114 | +// For speech-to-text |
| 115 | +embeddedSpeechConfig->SetSpeechRecognitionModel(( |
| 116 | + "Microsoft Speech Recognizer en-US FP Model V8", |
| 117 | + GetEnvironmentVariable("MODEL_KEY")); |
| 118 | + |
| 119 | +// For text-to-speech |
| 120 | +embeddedSpeechConfig->SetSpeechSynthesisVoice( |
| 121 | + "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)", |
| 122 | + GetEnvironmentVariable("VOICE_KEY")); |
| 123 | +embeddedSpeechConfig->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm); |
| 124 | +``` |
| 125 | +
|
| 126 | +You can find ready to use embedded speech samples at [GitHub](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/cpp/embedded-speech) |
| 127 | +::: zone-end |
| 128 | +
|
| 129 | +::: zone pivot="programming-language-java" |
| 130 | +
|
| 131 | +```java |
| 132 | +// Provide the location of the models and voices. |
| 133 | +List<String> paths = new ArrayList<>(); |
| 134 | +paths.add("C:\\dev\\embedded-speech\\stt-models"); |
| 135 | +paths.add("C:\\dev\\embedded-speech\\tts-voices"); |
| 136 | +var embeddedSpeechConfig = EmbeddedSpeechConfig.fromPaths(paths); |
| 137 | +
|
| 138 | +// For speech-to-text |
| 139 | +embeddedSpeechConfig.setSpeechRecognitionModel( |
| 140 | + "Microsoft Speech Recognizer en-US FP Model V8", |
| 141 | + System.getenv("MODEL_KEY")); |
| 142 | +
|
| 143 | +// For text-to-speech |
| 144 | +embeddedSpeechConfig.setSpeechSynthesisVoice( |
| 145 | + "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)", |
| 146 | + System.getenv("VOICE_KEY")); |
| 147 | +embeddedSpeechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm); |
| 148 | +``` |
| 149 | + |
| 150 | +You can find ready to use embedded speech samples at [GitHub](https://aka.ms/csspeech/samples). |
| 151 | +- [Java (JRE)](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/java/jre/embedded-speech) |
| 152 | +- [Java for Android](https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/java/android/embedded-speech) |
| 153 | +::: zone-end |
| 154 | + |
| 155 | + |
| 156 | +## Hybrid speech |
| 157 | + |
| 158 | +Hybrid speech with the `HybridSpeechConfig` object uses the cloud speech service by default and embedded speech as a fallback in case cloud connectivity is limited or slow. |
| 159 | + |
| 160 | +With hybrid speech configuration for [speech-to-text](speech-to-text.md) (recognition models), embedded speech is used when connection to the cloud service fails after repeated attempts. Recognition may continue using the cloud service again if the connection is later resumed. |
| 161 | + |
| 162 | +With hybrid speech configuration for [text-to-speech](text-to-speech.md) (voices), embedded and cloud synthesis are run in parallel and the result is selected based on which one gives a faster response. The best result is evaluated on each synthesis request. |
| 163 | + |
| 164 | +## Cloud speech |
| 165 | + |
| 166 | +For cloud speech, you use the `SpeechConfig` object, as shown in the [speech-to-text quickstart](get-started-speech-to-text.md) and [text-to-speech quickstart](get-started-text-to-speech.md). To run the quickstarts for embedded speech, you can replace `SpeechConfig` with `EmbeddedSpeechConfig` or `HybridSpeechConfig`. Most of the other speech recognition and synthesis code are the same, whether using cloud, embedded, or hybrid configuration. |
| 167 | + |
| 168 | +## Next steps |
| 169 | + |
| 170 | +- [Quickstart: Recognize and convert speech to text](get-started-speech-to-text.md) |
| 171 | +- [Quickstart: Convert text to speech](get-started-text-to-speech.md) |
0 commit comments