Embedded Speech

Caution

This article references CentOS, a Linux distribution that is nearing End Of Life (EOL) status. Please consider your use and planning accordingly.

Embedded Speech is designed for on-device speech to text and text to speech scenarios where cloud connectivity is intermittent or unavailable. For example, you can use embedded speech in industrial equipment, a voice enabled air conditioning unit, or a car that might travel out of range. You can also develop hybrid cloud and offline solutions. For scenarios where your devices must be in a secure environment like a bank or government entity, you should first consider disconnected containers.

Important

Microsoft limits access to embedded speech. You can apply for access through the Azure AI Speech embedded speech limited access review. For more information, see Limited access for embedded speech.

Platform requirements

Embedded speech is included with the Speech SDK (version 1.24.1 and higher) for C#, C++, and Java. Refer to the general Speech SDK installation requirements for programming language and target platform specific details.

Choose your target environment

Requires Android 7.0 (API level 24) or higher on Arm64 (arm64-v8a) or Arm32 (armeabi-v7a) hardware.

Embedded TTS with neural voices is only supported on Arm64.

Limitations

Embedded speech is only available with C#, C++, and Java SDKs. The other Speech SDKs, Speech CLI, and REST APIs don't support embedded speech.

Embedded speech recognition only supports mono 16 bit, 8-kHz or 16-kHz PCM-encoded WAV audio formats.

Embedded neural voices support 24 kHz RIFF/RAW, with a RAM requirement of 100 MB.

Embedded speech SDK packages

For C# embedded applications, install following Speech SDK for C# packages:

Package Description
Microsoft.CognitiveServices.Speech Required to use the Speech SDK
Microsoft.CognitiveServices.Speech.Extension.Embedded.SR Required for embedded speech recognition
Microsoft.CognitiveServices.Speech.Extension.Embedded.TTS Required for embedded speech synthesis
Microsoft.CognitiveServices.Speech.Extension.ONNX.Runtime Required for embedded speech recognition and synthesis
Microsoft.CognitiveServices.Speech.Extension.Telemetry Required for embedded speech recognition and synthesis

For C++ embedded applications, install following Speech SDK for C++ packages:

Package Description
Microsoft.CognitiveServices.Speech Required to use the Speech SDK
Microsoft.CognitiveServices.Speech.Extension.Embedded.SR Required for embedded speech recognition
Microsoft.CognitiveServices.Speech.Extension.Embedded.TTS Required for embedded speech synthesis
Microsoft.CognitiveServices.Speech.Extension.ONNX.Runtime Required for embedded speech recognition and synthesis
Microsoft.CognitiveServices.Speech.Extension.Telemetry Required for embedded speech recognition and synthesis

Choose your target environment

For Java embedded applications, add client-sdk-embedded (.jar) as a dependency. This package supports cloud, embedded, and hybrid speech.

Important

Don't add client-sdk in the same project, since it supports only cloud speech services.

Follow these steps to install the Speech SDK for Java using Apache Maven:

  1. Install Apache Maven.
  2. Open a command prompt where you want the new project, and create a new pom.xml file.
  3. Copy the following XML content into pom.xml:
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>com.microsoft.cognitiveservices.speech.samples</groupId>
        <artifactId>quickstart-eclipse</artifactId>
        <version>1.0.0-SNAPSHOT</version>
        <build>
            <sourceDirectory>src</sourceDirectory>
            <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                <source>1.8</source>
                <target>1.8</target>
                </configuration>
            </plugin>
            </plugins>
        </build>
        <dependencies>
            <dependency>
            <groupId>com.microsoft.cognitiveservices.speech</groupId>
            <artifactId>client-sdk-embedded</artifactId>
            <version>1.35.0</version>
            </dependency>
        </dependencies>
    </project>
    
  4. Run the following Maven command to install the Speech SDK and dependencies.
    mvn clean dependency:copy-dependencies
    

Models and voices

For embedded speech, you need to download the speech recognition models for speech to text and voices for text to speech. Instructions are provided upon successful completion of the limited access review process.

The following speech to text models are available: da-DK, de-DE, en-AU, en-CA, en-GB, en-IE, en-IN, en-NZ, en-US, es-ES, es-MX, fr-CA, fr-FR, it-IT, ja-JP, ko-KR, pt-BR, pt-PT, zh-CN, zh-HK, and zh-TW.

All text to speech locales here (except fa-IR, Persian (Iran)) are available out of box with either 1 selected female and/or 1 selected male voices. We welcome your input to help us gauge demand for more languages and voices.

Embedded speech configuration

For cloud connected applications, as shown in most Speech SDK samples, you use the SpeechConfig object with a Speech resource key and region. For embedded speech, you don't use a Speech resource. Instead of a cloud resource, you use the models and voices that you download to your local device.

Use the EmbeddedSpeechConfig object to set the location of the models or voices. If your application is used for both speech to text and text to speech, you can use the same EmbeddedSpeechConfig object to set the location of the models and voices.

// Provide the location of the models and voices.
List<string> paths = new List<string>();
paths.Add("C:\\dev\\embedded-speech\\stt-models");
paths.Add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.FromPaths(paths.ToArray());

// For speech to text
embeddedSpeechConfig.SetSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8", 
    Environment.GetEnvironmentVariable("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig.SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    Environment.GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

Tip

The GetEnvironmentVariable function is defined in the speech to text quickstart and text to speech quickstart.

// Provide the location of the models and voices.
vector<string> paths;
paths.push_back("C:\\dev\\embedded-speech\\stt-models");
paths.push_back("C:\\dev\\embedded-speech\\tts-voices");
auto embeddedSpeechConfig = EmbeddedSpeechConfig::FromPaths(paths);

// For speech to text
embeddedSpeechConfig->SetSpeechRecognitionModel((
    "Microsoft Speech Recognizer en-US FP Model V8", 
    GetEnvironmentVariable("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig->SetSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    GetEnvironmentVariable("VOICE_KEY"));
embeddedSpeechConfig->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);
// Provide the location of the models and voices.
List<String> paths = new ArrayList<>();
paths.add("C:\\dev\\embedded-speech\\stt-models");
paths.add("C:\\dev\\embedded-speech\\tts-voices");
var embeddedSpeechConfig = EmbeddedSpeechConfig.fromPaths(paths);

// For speech to text
embeddedSpeechConfig.setSpeechRecognitionModel(
    "Microsoft Speech Recognizer en-US FP Model V8", 
    System.getenv("MODEL_KEY"));

// For text to speech
embeddedSpeechConfig.setSpeechSynthesisVoice(
    "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
    System.getenv("VOICE_KEY"));
embeddedSpeechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

Embedded speech code samples

You can find ready to use embedded speech samples at GitHub. For remarks on projects from scratch, see samples specific documentation:

You can find ready to use embedded speech samples at GitHub. For remarks on projects from scratch, see samples specific documentation:

You can find ready to use embedded speech samples at GitHub. For remarks on projects from scratch, see samples specific documentation:

Hybrid speech

Hybrid speech with the HybridSpeechConfig object uses the cloud speech service by default and embedded speech as a fallback in case cloud connectivity is limited or slow.

With hybrid speech configuration for speech to text (recognition models), embedded speech is used when connection to the cloud service fails after repeated attempts. Recognition might continue using the cloud service again if the connection is later resumed.

With hybrid speech configuration for text to speech (voices), embedded and cloud synthesis are run in parallel and the final result is selected based on response speed. The best result is evaluated again on each new synthesis request.

Cloud speech

For cloud speech, you use the SpeechConfig object, as shown in the speech to text quickstart and text to speech quickstart. To run the quickstarts for embedded speech, you can replace SpeechConfig with EmbeddedSpeechConfig or HybridSpeechConfig. Most of the other speech recognition and synthesis code are the same, whether using cloud, embedded, or hybrid configuration.

Embedded voices capabilities

For embedded voices, it's essential to note that certain SSML tags might not be currently supported due to differences in the model structure. For detailed information regarding the unsupported SSML tags, refer to the following table.

Level 1 Level 2 Sub values Support in embedded NTTS
audio src No
bookmark Yes
break strength No
time No
silence type Leading, Tailing, Comma-exact, etc. No
value No
emphasis level No
lang No
lexicon uri Yes
math No
msttsaudioduration value No
msttsbackgroundaudio src No
volume No
fadein No
fadeout No
msttsexpress-as style No
styledegree No
role No
msttssilence No
msttsviseme type redlips_front, FacialExpression No
p Yes
phoneme alphabet ipa, sapi, ups, etc. Yes
ph Yes
prosody contour Sentences level support, word level only en-US and zh-CN Yes
pitch Yes
range Yes
rate Yes
volume Yes
s Yes
say-as interpret-as characters, spell-out, number_digit, date, etc. Yes
format Yes
detail Yes
sub alias Yes
speak Yes
voice No

Next steps