Embedded Speech (preview)

Embedded Speech is designed for on-device speech-to-text and text-to-speech scenarios where cloud connectivity is intermittent or unavailable. For example, you can use embedded speech in industrial equipment, a voice enabled air conditioning unit, or a car that might travel out of range. You can also develop hybrid cloud and offline solutions. For scenarios where your devices must be in a secure environment like a bank or government entity, you should first consider disconnected containers.

Important

Microsoft limits access to embedded speech. You can apply for access through the Azure Cognitive Services embedded speech limited access review. For more information, see Limited access for embedded speech.

Platform requirements

Embedded speech is included with the Speech SDK (version 1.24.1 and higher) for C#, C++, and Java. Refer to the general Speech SDK installation requirements for programming language and target platform specific details.

Choose your target environment

Requires Android 7.0 (API level 24) or higher on ARM64 (arm64-v8a) or ARM32 (armeabi-v7a) hardware.

Embedded TTS with neural voices is only supported on ARM64.

Limitations

Embedded speech is only available with C#, C++, and Java SDKs. The other Speech SDKs, Speech CLI, and REST APIs don't support embedded speech.

Embedded speech recognition only supports mono 16 bit, 16-kHz PCM-encoded WAV audio.

Embedded neural voices only support 24-kHz sample rate.

Models and voices

For embedded speech, you'll need to download the speech recognition models for speech-to-text and voices for text-to-speech. Instructions will be provided upon successful completion of the limited access review process.

The following speech-to-text models are available: de-DE, en-AU, en-CA, en-GB, en-IE, en-IN, en-NZ, en-US, es-ES, es-MX, fr-CA, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, nl-NL, pt-BR, ru-RU, sv-SE, tr-TR, zh-CN, zh-HK, and zh-TW.

The following text-to-speech locales and voices are available:

Locale (BCP-47) Language Text-to-speech voices
de-DE German (Germany) de-DE-KatjaNeural (Female)
de-DE-ConradNeural (Male)
en-AU English (Australia) en-AU-AnnetteNeural (Female)
en-AU-WilliamNeural (Male)
en-CA English (Canada) en-CA-ClaraNeural (Female)
en-CA-LiamNeural (Male)
en-GB English (United Kingdom) en-GB-LibbyNeural (Female)
en-GB-RyanNeural (Male)
en-US English (United States) en-US-AriaNeural (Female)
en-US-GuyNeural (Male)
en-US-JennyNeural (Female)
es-ES Spanish (Spain) es-ES-ElviraNeural (Female)
es-ES-AlvaroNeural (Male)
es-MX Spanish (Mexico) es-MX-DaliaNeural (Female)
es-MX-JorgeNeural (Male)
fr-CA French (Canada) fr-CA-SylvieNeural (Female)
fr-CA-JeanNeural (Male)
fr-FR French (France) fr-FR-DeniseNeural (Female)
fr-FR-HenriNeural (Male)
it-IT Italian (Italy) it-IT-IsabellaNeural (Female)
it-IT-DiegoNeural (Male)
ja-JP Japanese (Japan) ja-JP-NanamiNeural (Female)
ja-JP-KeitaNeural (Male)
ko-KR Korean (Korea) ko-KR-SunHiNeural (Female)
ko-KR-InJoonNeural (Male)
pr-BR Portuguese (Brazil) pt-BR-FranciscaNeural (Female)
pt-BR-AntonioNeural (Male)
zh-CN Chinese (Mandarin, Simplified) zh-CN-XiaoxiaoNeural (Female)
zh-CN-YunxiNeural (Male)

Embedded speech configuration

For cloud connected applications, as shown in most Speech SDK samples, you use the SpeechConfig object with a Speech resource key and region. For embedded speech, you don't use a Speech resource. Instead of a cloud resource, you use the models and voices that you downloaded to your local device.

Use the EmbeddedSpeechConfig object to set the location of the models or voices. If your application is used for both speech-to-text and text-to-speech, you can use the same EmbeddedSpeechConfig object to set the location of the models and voices.

// Provide the location of the models and voices.
List<string> paths = new List<string>();
paths.Add("C:\\dev\\embedded-speech\\stt-models");
paths.Add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.FromPaths(paths.ToArray());

// For speech-to-text
embeddedSpeechConfig.SetSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8", 
    Environment.GetEnvironmentVariable("MODEL_KEY"));

// For text-to-speech
embeddedSpeechConfig.SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    Environment.GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

You can find ready to use embedded speech samples at GitHub.

Tip

The GetEnvironmentVariable function is defined in the speech-to-text quickstart and text-to-speech quickstart.

// Provide the location of the models and voices.
vector<string> paths;
paths.push_back("C:\\dev\\embedded-speech\\stt-models");
paths.push_back("C:\\dev\\embedded-speech\\tts-voices");
auto embeddedSpeechConfig = EmbeddedSpeechConfig::FromPaths(paths);

// For speech-to-text
embeddedSpeechConfig->SetSpeechRecognitionModel((
    "Microsoft Speech Recognizer en-US FP Model V8", 
    GetEnvironmentVariable("MODEL_KEY"));

// For text-to-speech
embeddedSpeechConfig->SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);

You can find ready to use embedded speech samples at GitHub

// Provide the location of the models and voices.
List<String> paths = new ArrayList<>();
paths.add("C:\\dev\\embedded-speech\\stt-models");
paths.add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.fromPaths(paths);

// For speech-to-text
embeddedSpeechConfig.setSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8", 
    System.getenv("MODEL_KEY"));

// For text-to-speech
embeddedSpeechConfig.setSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    System.getenv("VOICE_KEY"));
embeddedSpeechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

You can find ready to use embedded speech samples at GitHub.

Hybrid speech

Hybrid speech with the HybridSpeechConfig object uses the cloud speech service by default and embedded speech as a fallback in case cloud connectivity is limited or slow.

With hybrid speech configuration for speech-to-text (recognition models), embedded speech is used when connection to the cloud service fails after repeated attempts. Recognition may continue using the cloud service again if the connection is later resumed.

With hybrid speech configuration for text-to-speech (voices), embedded and cloud synthesis are run in parallel and the result is selected based on which one gives a faster response. The best result is evaluated on each synthesis request.

Cloud speech

For cloud speech, you use the SpeechConfig object, as shown in the speech-to-text quickstart and text-to-speech quickstart. To run the quickstarts for embedded speech, you can replace SpeechConfig with EmbeddedSpeechConfig or HybridSpeechConfig. Most of the other speech recognition and synthesis code are the same, whether using cloud, embedded, or hybrid configuration.

Next steps