Embedded Speech

Article
01/18/2024

Embedded Speech is designed for on-device speech to text and text to speech scenarios where cloud connectivity is intermittent or unavailable. For example, you can use embedded speech in industrial equipment, a voice enabled air conditioning unit, or a car that might travel out of range. You can also develop hybrid cloud and offline solutions. For scenarios where your devices must be in a secure environment like a bank or government entity, you should first consider disconnected containers.

Important

Microsoft limits access to embedded speech. You can apply for access through the Azure AI Speech embedded speech limited access review. For more information, see Limited access for embedded speech.

Platform requirements

Embedded speech is included with the Speech SDK (version 1.24.1 and higher) for C#, C++, and Java. Refer to the general Speech SDK installation requirements for programming language and target platform specific details.

Choose your target environment

Requires Android 7.0 (API level 24) or higher on Arm64 (arm64-v8a) or Arm32 (armeabi-v7a) hardware.

Embedded TTS with neural voices is only supported on Arm64.

Limitations

Embedded speech is only available with C#, C++, and Java SDKs. The other Speech SDKs, Speech CLI, and REST APIs don't support embedded speech.

Embedded speech recognition only supports mono 16 bit, 8-kHz or 16-kHz PCM-encoded WAV audio formats.

Embedded neural voices support 24 kHz RIFF/RAW, with a RAM requirement of 100 MB.

Embedded speech SDK packages

For C# embedded applications, install following Speech SDK for C# packages:

Package	Description
Microsoft.CognitiveServices.Speech	Required to use the Speech SDK
Microsoft.CognitiveServices.Speech.Extension.Embedded.SR	Required for embedded speech recognition
Microsoft.CognitiveServices.Speech.Extension.Embedded.TTS	Required for embedded speech synthesis
Microsoft.CognitiveServices.Speech.Extension.ONNX.Runtime	Required for embedded speech recognition and synthesis
Microsoft.CognitiveServices.Speech.Extension.Telemetry	Required for embedded speech recognition and synthesis

For C++ embedded applications, install following Speech SDK for C++ packages:

Package	Description
Microsoft.CognitiveServices.Speech	Required to use the Speech SDK
Microsoft.CognitiveServices.Speech.Extension.Embedded.SR	Required for embedded speech recognition
Microsoft.CognitiveServices.Speech.Extension.Embedded.TTS	Required for embedded speech synthesis
Microsoft.CognitiveServices.Speech.Extension.ONNX.Runtime	Required for embedded speech recognition and synthesis
Microsoft.CognitiveServices.Speech.Extension.Telemetry	Required for embedded speech recognition and synthesis

Choose your target environment

Java Runtime
Android

For Java embedded applications, add client-sdk-embedded (.jar) as a dependency. This package supports cloud, embedded, and hybrid speech.

Important

Don't add client-sdk in the same project, since it supports only cloud speech services.

Follow these steps to install the Speech SDK for Java using Apache Maven:

Install Apache Maven.
Open a command prompt where you want the new project, and create a new pom.xml file.

Copy the following XML content into pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.microsoft.cognitiveservices.speech.samples</groupId>
    <artifactId>quickstart-eclipse</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <build>
        <sourceDirectory>src</sourceDirectory>
        <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.7.0</version>
            <configuration>
            <source>1.8</source>
            <target>1.8</target>
            </configuration>
        </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
        <groupId>com.microsoft.cognitiveservices.speech</groupId>
        <artifactId>client-sdk-embedded</artifactId>
        <version>1.37.0</version>
        </dependency>
    </dependencies>
</project>

Run the following Maven command to install the Speech SDK and dependencies.
```
mvn clean dependency:copy-dependencies
```

For Java embedded applications, add client-sdk-embedded (.aar) as a dependency. This package supports cloud, embedded, and hybrid speech.

Important

Don't add client-sdk in the same project, since it supports only cloud speech services.

Be sure to use the @aar suffix when the dependency is specified in build.gradle. Here's an example:

dependencies {
    implementation 'com.microsoft.cognitiveservices.speech:client-sdk-embedded:1.37.0@aar'
}

Models and voices

For embedded speech, you need to download the speech recognition models for speech to text and voices for text to speech. Instructions are provided upon successful completion of the limited access review process.

The following speech to text models are available: da-DK, de-DE, en-AU, en-CA, en-GB, en-IE, en-IN, en-NZ, en-US, es-ES, es-MX, fr-CA, fr-FR, it-IT, ja-JP, ko-KR, pt-BR, pt-PT, zh-CN, zh-HK, and zh-TW.

All text to speech locales here (except fa-IR, Persian (Iran)) are available out of box with either 1 selected female and/or 1 selected male voices. We welcome your input to help us gauge demand for more languages and voices.

Embedded speech configuration

For cloud connected applications, as shown in most Speech SDK samples, you use the SpeechConfig object with a Speech resource key and region. For embedded speech, you don't use a Speech resource. Instead of a cloud resource, you use the models and voices that you download to your local device.

Use the EmbeddedSpeechConfig object to set the location of the models or voices. If your application is used for both speech to text and text to speech, you can use the same EmbeddedSpeechConfig object to set the location of the models and voices.

// Provide the location of the models and voices.
List<string> paths = new List<string>();
paths.Add("C:\\dev\\embedded-speech\\stt-models");
paths.Add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.FromPaths(paths.ToArray());

// For speech to text
embeddedSpeechConfig.SetSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8",
    Environment.GetEnvironmentVariable("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig.SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    Environment.GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

Tip

The GetEnvironmentVariable function is defined in the speech to text quickstart and text to speech quickstart.

// Provide the location of the models and voices.
vector<string> paths;
paths.push_back("C:\\dev\\embedded-speech\\stt-models");
paths.push_back("C:\\dev\\embedded-speech\\tts-voices");
auto embeddedSpeechConfig = EmbeddedSpeechConfig::FromPaths(paths);

// For speech to text
embeddedSpeechConfig->SetSpeechRecognitionModel((
    "Microsoft Speech Recognizer en-US FP Model V8",
    GetEnvironmentVariable("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig->SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);

// Provide the location of the models and voices.
List<String> paths = new ArrayList<>();
paths.add("C:\\dev\\embedded-speech\\stt-models");
paths.add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.fromPaths(paths);

// For speech to text
embeddedSpeechConfig.setSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8",
    System.getenv("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig.setSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    System.getenv("VOICE_KEY"));
embeddedSpeechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

Embedded speech code samples

You can find ready to use embedded speech samples at GitHub. For remarks on projects from scratch, see samples specific documentation:

Hybrid speech

Hybrid speech with the HybridSpeechConfig object uses the cloud speech service by default and embedded speech as a fallback in case cloud connectivity is limited or slow.

With hybrid speech configuration for speech to text (recognition models), embedded speech is used when connection to the cloud service fails after repeated attempts. Recognition might continue using the cloud service again if the connection is later resumed.

With hybrid speech configuration for text to speech (voices), embedded and cloud synthesis are run in parallel and the final result is selected based on response speed. The best result is evaluated again on each new synthesis request.

Cloud speech

For cloud speech, you use the SpeechConfig object, as shown in the speech to text quickstart and text to speech quickstart. To run the quickstarts for embedded speech, you can replace SpeechConfig with EmbeddedSpeechConfig or HybridSpeechConfig. Most of the other speech recognition and synthesis code are the same, whether using cloud, embedded, or hybrid configuration.

Embedded voices capabilities

For embedded voices, it's essential to note that certain SSML tags might not be currently supported due to differences in the model structure. For detailed information regarding the unsupported SSML tags, refer to the following table.

Level 1	Level 2	Sub values	Support in embedded NTTS
audio	src		No
bookmark			Yes
break	strength		No
	time		No
silence	type	Leading, Tailing, Comma-exact, etc.	No
	value		No
emphasis	level		No
lang			No
lexicon	uri		Yes
math			No
msttsaudioduration	value		No
msttsbackgroundaudio	src		No
	volume		No
	fadein		No
	fadeout		No
msttsexpress-as	style		No
	styledegree		No
	role		No
msttssilence			No
msttsviseme	type	redlips_front, FacialExpression	No
p			Yes
phoneme	alphabet	ipa, sapi, ups, etc.	Yes
	ph		Yes
prosody	contour	Sentences level support, word level only en-US and zh-CN	Yes
	pitch		Yes
	range		Yes
	rate		Yes
	volume		Yes
s			Yes
say-as	interpret-as	characters, spell-out, number_digit, date, etc.	Yes
	format		Yes
	detail		Yes
sub	alias		Yes
speak			Yes
voice			No

Embedded Speech

Platform requirements

Limitations

Embedded speech SDK packages

Models and voices

Embedded speech configuration

Embedded speech code samples

Hybrid speech

Cloud speech

Embedded voices capabilities

Next steps

Feedback

Additional resources