How to recognize and translate speech

Artikkeli
09/20/2024

Reference documentation | Package (NuGet) | Additional samples on GitHub

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Sensitive data and environment variables

The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The Program class contains two static readonly string values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY and SPEECH__SERVICE__REGION. Both of these fields are at the class scope, so they're accessible within method bodies of the class:

public class Program
{
    static readonly string SPEECH__SUBSCRIPTION__KEY =
        Environment.GetEnvironmentVariable(nameof(SPEECH__SUBSCRIPTION__KEY));
    
    static readonly string SPEECH__SERVICE__REGION =
        Environment.GetEnvironmentVariable(nameof(SPEECH__SERVICE__REGION));

    static Task Main() => Task.CompletedTask;
}

For more information on environment variables, see Environment variables and application configuration.

Important

Use API keys with caution. Don't include the API key directly in your code, and never post it publicly. If you use an API key, store it securely in Azure Key Vault. For more information about using API keys securely in your apps, see API keys with Azure Key Vault.

For more information about AI services security, see Authenticate requests to Azure AI services.

Create a speech translation configuration

To call the Speech service by using the Speech SDK, you need to create a SpeechTranslationConfig instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Tip

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

You can initialize SpeechTranslationConfig in a few ways:

With a subscription: pass in a key and the associated region.
With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
With a host: pass in a host address. A key or authorization token is optional.
With an authorization token: pass in an authorization token and the associated region.

Let's look at how you create a SpeechTranslationConfig instance by using a key and region. Get the Speech resource key and region in the Azure portal.

public class Program
{
    static readonly string SPEECH__SUBSCRIPTION__KEY =
        Environment.GetEnvironmentVariable(nameof(SPEECH__SUBSCRIPTION__KEY));
    
    static readonly string SPEECH__SERVICE__REGION =
        Environment.GetEnvironmentVariable(nameof(SPEECH__SERVICE__REGION));

    static Task Main() => TranslateSpeechAsync();

    static async Task TranslateSpeechAsync()
    {
        var speechTranslationConfig =
            SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    }
}

Change the source language

One common task of speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, interact with the SpeechTranslationConfig instance by assigning it to the SpeechRecognitionLanguage property:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    // Source (input) language
    speechTranslationConfig.SpeechRecognitionLanguage = "it-IT";
}

The SpeechRecognitionLanguage property expects a language-locale format string. Refer to the list of supported speech translation locales.

Add a translation language

Another common task of speech translation is to specify target translation languages. At least one is required, but multiples are supported. The following code snippet sets both French and German as translation language targets:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    speechTranslationConfig.SpeechRecognitionLanguage = "it-IT";
    
    speechTranslationConfig.AddTargetLanguage("fr");
    speechTranslationConfig.AddTargetLanguage("de");
}

With every call to AddTargetLanguage, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.

Initialize a translation recognizer

After you created a SpeechTranslationConfig instance, the next step is to initialize TranslationRecognizer. When you initialize TranslationRecognizer, you need to pass it your speechTranslationConfig instance. The configuration object provides the credentials that the Speech service requires to validate your request.

If you're recognizing speech by using your device's default microphone, here's what the TranslationRecognizer instance should look like:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    var fromLanguage = "en-US";
    var toLanguages = new List<string> { "it", "fr", "de" };
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    toLanguages.ForEach(speechTranslationConfig.AddTargetLanguage);

    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);
}

If you want to specify the audio input device, then you need to create an AudioConfig class instance and provide the audioConfig parameter when initializing TranslationRecognizer.

Tip

Learn how to get the device ID for your audio input device.

First, reference the AudioConfig object as follows:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    var fromLanguage = "en-US";
    var toLanguages = new List<string> { "it", "fr", "de" };
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    toLanguages.ForEach(speechTranslationConfig.AddTargetLanguage);

    using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
}

If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig parameter. However, when you create an AudioConfig class instance, instead of calling FromDefaultMicrophoneInput, you call FromWavFileInput and pass the filename parameter:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    var fromLanguage = "en-US";
    var toLanguages = new List<string> { "it", "fr", "de" };
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    toLanguages.ForEach(speechTranslationConfig.AddTargetLanguage);

    using var audioConfig = AudioConfig.FromWavFileInput("YourAudioFile.wav");
    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);
}

Translate speech

To translate speech, the Speech SDK relies on a microphone or an audio file input. Speech recognition occurs before speech translation. After all objects are initialized, call the recognize-once function and get the result:

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    var fromLanguage = "en-US";
    var toLanguages = new List<string> { "it", "fr", "de" };
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    toLanguages.ForEach(speechTranslationConfig.AddTargetLanguage);

    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);

    Console.Write($"Say something in '{fromLanguage}' and ");
    Console.WriteLine($"we'll translate into '{string.Join("', '", toLanguages)}'.\n");
    
    var result = await translationRecognizer.RecognizeOnceAsync();
    if (result.Reason == ResultReason.TranslatedSpeech)
    {
        Console.WriteLine($"Recognized: \"{result.Text}\":");
        foreach (var element in result.Translations)
        {
            Console.WriteLine($"    TRANSLATED into '{element.Key}': {element.Value}");
        }
    }
}

For more information about speech to text, see the basics of speech recognition.

Event based translation

The TranslationRecognizer object exposes a Recognizing event. The event fires several times and provides a mechanism to retrieve the intermediate translation results.

Note

Intermediate translation results aren't available when you use multi-lingual speech translation.

The following example prints the intermediate translation results to the console:

using (var audioInput = AudioConfig.FromWavFileInput(@"whatstheweatherlike.wav"))
{
    using (var translationRecognizer = new TranslationRecognizer(config, audioInput))
    {
        // Subscribes to events.
        translationRecognizer.Recognizing += (s, e) =>
        {
            Console.WriteLine($"RECOGNIZING in '{fromLanguage}': Text={e.Result.Text}");
            foreach (var element in e.Result.Translations)
            {
                Console.WriteLine($"    TRANSLATING into '{element.Key}': {element.Value}");
            }
        };

        translationRecognizer.Recognized += (s, e) => {
            if (e.Result.Reason == ResultReason.TranslatedSpeech)
            {
                Console.WriteLine($"RECOGNIZED in '{fromLanguage}': Text={e.Result.Text}");
                foreach (var element in e.Result.Translations)
                {
                    Console.WriteLine($"    TRANSLATED into '{element.Key}': {element.Value}");
                }
            }
            else if (e.Result.Reason == ResultReason.RecognizedSpeech)
            {
                Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
                Console.WriteLine($"    Speech not translated.");
            }
            else if (e.Result.Reason == ResultReason.NoMatch)
            {
                Console.WriteLine($"NOMATCH: Speech could not be recognized.");
            }
        };

        // Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
        Console.WriteLine("Start translation...");
        await translationRecognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

        // Waits for completion.
        // Use Task.WaitAny to keep the task rooted.
        Task.WaitAny(new[] { stopTranslation.Task });

        // Stops translation.
        await translationRecognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
    }
}

Synthesize translations

After a successful speech recognition and translation, the result contains all the translations in a dictionary. The Translations dictionary key is the target translation language, and the value is the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).

Event-based synthesis

The TranslationRecognizer object exposes a Synthesizing event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.

Specify the synthesis voice by assigning a VoiceName instance, and provide an event handler for the Synthesizing event to get the audio. The following example saves the translated audio as a .wav file.

Important

The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the VoiceName value should be the same language as the target translation language. For example, "de" could map to "de-DE-Hedda".

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    var fromLanguage = "en-US";
    var toLanguage = "de";
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    speechTranslationConfig.AddTargetLanguage(toLanguage);

    speechTranslationConfig.VoiceName = "de-DE-Hedda";

    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);

    translationRecognizer.Synthesizing += (_, e) =>
    {
        var audio = e.Result.GetAudio();
        Console.WriteLine($"Audio synthesized: {audio.Length:#,0} byte(s) {(audio.Length == 0 ? "(Complete)" : "")}");

        if (audio.Length > 0)
        {
            File.WriteAllBytes("YourAudioFile.wav", audio);
        }
    };

    Console.Write($"Say something in '{fromLanguage}' and ");
    Console.WriteLine($"we'll translate into '{toLanguage}'.\n");

    var result = await translationRecognizer.RecognizeOnceAsync();
    if (result.Reason == ResultReason.TranslatedSpeech)
    {
        Console.WriteLine($"Recognized: \"{result.Text}\"");
        Console.WriteLine($"Translated into '{toLanguage}': {result.Translations[toLanguage]}");
    }
}

Manual synthesis

You can use the Translations dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer instance, the SpeechConfig object needs to have its SpeechSynthesisVoiceName property set to the desired voice.

The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.

static async Task TranslateSpeechAsync()
{
    var speechTranslationConfig =
        SpeechTranslationConfig.FromSubscription(SPEECH__SERVICE__KEY, SPEECH__SERVICE__REGION);

    var fromLanguage = "en-US";
    var toLanguages = new List<string> { "de", "en", "it", "pt", "zh-Hans" };
    speechTranslationConfig.SpeechRecognitionLanguage = fromLanguage;
    toLanguages.ForEach(speechTranslationConfig.AddTargetLanguage);

    using var translationRecognizer = new TranslationRecognizer(speechTranslationConfig);

    Console.Write($"Say something in '{fromLanguage}' and ");
    Console.WriteLine($"we'll translate into '{string.Join("', '", toLanguages)}'.\n");

    var result = await translationRecognizer.RecognizeOnceAsync();
    if (result.Reason == ResultReason.TranslatedSpeech)
    {
        var languageToVoiceMap = new Dictionary<string, string>
        {
            ["de"] = "de-DE-KatjaNeural",
            ["en"] = "en-US-AriaNeural",
            ["it"] = "it-IT-ElsaNeural",
            ["pt"] = "pt-BR-FranciscaNeural",
            ["zh-Hans"] = "zh-CN-XiaoxiaoNeural"
        };

        Console.WriteLine($"Recognized: \"{result.Text}\"");

        foreach (var (language, translation) in result.Translations)
        {
            Console.WriteLine($"Translated into '{language}': {translation}");

            var speechConfig =
                SpeechConfig.FromSubscription(
                    SPEECH__SERVICE__KEY, SPEECH__SERVICE__REGION);
            speechConfig.SpeechSynthesisVoiceName = languageToVoiceMap[language];

            using var audioConfig = AudioConfig.FromWavFileOutput($"{language}-translation.wav");
            using var speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
            
            await speechSynthesizer.SpeakTextAsync(translation);
        }
    }
}

For more information about speech synthesis, see the basics of speech synthesis.

Multi-lingual translation with language identification

In many scenarios, you might not know which input languages to specify. Using language identification you can detect up to 10 possible input languages and automatically translate to your target languages.

The following example anticipates that en-US or zh-CN should be detected because they're defined in AutoDetectSourceLanguageConfig. Then, the speech is translated to de and fr as specified in the calls to AddTargetLanguage().

speechTranslationConfig.AddTargetLanguage("de");
speechTranslationConfig.AddTargetLanguage("fr");
var autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig.FromLanguages(new string[] { "en-US", "zh-CN" });
var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);

For a complete code sample, see language identification.

Multi-lingual speech translation without source language candidates

Multi-lingual speech translation implements a new level of speech translation technology that unlocks various capabilities, including having no specified input language, and handling language switches within the same session. These features enable a new level of speech translation powers that can be implemented into your products.

Currently when you use Language ID with speech translation, you must create the SpeechTranslationConfig object from the v2 endpoint. Replace the string "YourServiceRegion" with your Speech resource region (such as "westus"). Replace "YourSubscriptionKey" with your Speech resource key.

var v2EndpointInString = String.Format("wss://{0}.stt.speech.microsoft.com/speech/universal/v2", "YourServiceRegion");
var v2EndpointUrl = new Uri(v2EndpointInString);
var speechTranslationConfig = SpeechTranslationConfig.FromEndpoint(v2EndpointUrl, "YourSubscriptionKey");

Specify the translation target languages. Replace with languages of your choice. You can add more lines.

config.AddTargetLanguage("de");
config.AddTargetLanguage("fr");

A key differentiator with multi-lingual speech translation is that you do not need to specify the source language. This is because the service will automatically detect the source language. Create the AutoDetectSourceLanguageConfig object with the fromOpenRange method to let the service know that you want to use multi-lingual speech translation with no specified source language.

AutoDetectSourceLanguageConfig autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig.fromOpenRange(); 
var translationRecognizer = new TranslationRecognizer(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);

For a complete code sample with the Speech SDK, see speech translation samples on GitHub.

Using custom translation in speech translation

The custom translation feature in speech translation seamlessly integrates with the Azure Custom Translation service, allowing you to achieve more accurate and tailored translations. As the integration directly harnesses the capabilities of the Azure custom translation service, you need to use a multi-service resource to ensure the correct functioning of the complete set of features. For detailed instructions, please consult the guide on Create a multi-service resource for Azure AI services.

Additionally, for offline training of a custom translator and obtaining a "Category ID," please refer to the step-by-step script provided in the Quickstart: Build, deploy, and use a custom model - Custom Translator.

// Creates an instance of a translation recognizer using speech translation configuration
// You should use the same subscription key, which you used to generate the custom model before.
// V2 endpoint is required for the “Custom Translation” feature. Example: "wss://westcentralus.stt.speech.microsoft.com/speech/universal/v2"

try (SpeechTranslationConfig config = SpeechTranslationConfig.fromEndpoint(URI.create(endpointUrl), speechSubscriptionKey)) {

            // Sets source and target language(s).
           ….

            // Set the category id
            config.setCustomModelCategoryId("yourCategoryId");

       ….
}

Reference documentation | Package (NuGet) | Additional samples on GitHub

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Sensitive data and environment variables

The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The C++ code file contains two string values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY and SPEECH__SERVICE__REGION. Both of these fields are at the class scope, so they're accessible within method bodies of the class:

auto SPEECH__SUBSCRIPTION__KEY = getenv("SPEECH__SUBSCRIPTION__KEY");
auto SPEECH__SERVICE__REGION = getenv("SPEECH__SERVICE__REGION");

For more information on environment variables, see Environment variables and application configuration.

Important

For more information about AI services security, see Authenticate requests to Azure AI services.

Create a speech translation configuration

Tip

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

You can initialize SpeechTranslationConfig in a few ways:

With a subscription: pass in a key and the associated region.
With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
With a host: pass in a host address. A key or authorization token is optional.
With an authorization token: pass in an authorization token and the associated region.

Let's look at how you create a SpeechTranslationConfig instance by using a key and region. Get the Speech resource key and region in the Azure portal.

auto SPEECH__SUBSCRIPTION__KEY = getenv("SPEECH__SUBSCRIPTION__KEY");
auto SPEECH__SERVICE__REGION = getenv("SPEECH__SERVICE__REGION");

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
}

int main(int argc, char** argv) {
    setlocale(LC_ALL, "");
    translateSpeech();
    return 0;
}

Change the source language

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    // Source (input) language
    speechTranslationConfig->SetSpeechRecognitionLanguage("it-IT");
}

The SpeechRecognitionLanguage property expects a language-locale format string. Refer to the list of supported speech translation locales.

Add a translation language

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    speechTranslationConfig->SetSpeechRecognitionLanguage("it-IT");

    speechTranslationConfig->AddTargetLanguage("fr");
    speechTranslationConfig->AddTargetLanguage("de");
}

Initialize a translation recognizer

After you created a SpeechTranslationConfig instance, the next step is to initialize TranslationRecognizer. When you initialize TranslationRecognizer, you need to pass it your translationConfig instance. The configuration object provides the credentials that the Speech service requires to validate your request.

If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer should look like:

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    auto fromLanguage = "en-US";
    auto toLanguages = { "it", "fr", "de" };
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    for (auto language : toLanguages) {
        speechTranslationConfig->AddTargetLanguage(language);
    }

    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
}

If you want to specify the audio input device, then you need to create an AudioConfig class instance and provide the audioConfig parameter when initializing TranslationRecognizer.

Tip

Learn how to get the device ID for your audio input device.

First, reference the AudioConfig object as follows:

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    auto fromLanguage = "en-US";
    auto toLanguages = { "it", "fr", "de" };
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    for (auto language : toLanguages) {
        speechTranslationConfig->AddTargetLanguage(language);
    }

    auto audioConfig = AudioConfig::FromDefaultMicrophoneInput();
    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig, audioConfig);
}

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    auto fromLanguage = "en-US";
    auto toLanguages = { "it", "fr", "de" };
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    for (auto language : toLanguages) {
        speechTranslationConfig->AddTargetLanguage(language);
    }

    auto audioConfig = AudioConfig::FromWavFileInput("YourAudioFile.wav");
    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig, audioConfig);
}

Translate speech

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    string fromLanguage = "en-US";
    string toLanguages[3] = { "it", "fr", "de" };
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    for (auto language : toLanguages) {
        speechTranslationConfig->AddTargetLanguage(language);
    }

    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
    cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";

    auto result = translationRecognizer->RecognizeOnceAsync().get();
    if (result->Reason == ResultReason::TranslatedSpeech)
    {
        cout << "Recognized: \"" << result->Text << "\"" << std::endl;
        for (auto pair : result->Translations)
        {
            auto language = pair.first;
            auto translation = pair.second;
            cout << "Translated into '" << language << "': " << translation << std::endl;
        }
    }
}

For more information about speech to text, see the basics of speech recognition.

Synthesize translations

Event-based synthesis

Specify the synthesis voice by assigning a SetVoiceName instance, and provide an event handler for the Synthesizing event to get the audio. The following example saves the translated audio as a .wav file.

Important

The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the SetVoiceName value should be the same language as the target translation language. For example, "de" could map to "de-DE-Hedda".

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    auto fromLanguage = "en-US";
    auto toLanguage = "de";
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    speechTranslationConfig->AddTargetLanguage(toLanguage);

    speechTranslationConfig->SetVoiceName("de-DE-Hedda");

    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);
    translationRecognizer->Synthesizing.Connect([](const TranslationSynthesisEventArgs& e)
        {
            auto audio = e.Result->Audio;
            auto size = audio.size();
            cout << "Audio synthesized: " << size << " byte(s)" << (size == 0 ? "(COMPLETE)" : "") << std::endl;

            if (size > 0) {
                ofstream file("translation.wav", ios::out | ios::binary);
                auto audioData = audio.data();
                file.write((const char*)audioData, sizeof(audio[0]) * size);
                file.close();
            }
        });

    cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";

    auto result = translationRecognizer->RecognizeOnceAsync().get();
    if (result->Reason == ResultReason::TranslatedSpeech)
    {
        cout << "Recognized: \"" << result->Text << "\"" << std::endl;
        for (auto pair : result->Translations)
        {
            auto language = pair.first;
            auto translation = pair.second;
            cout << "Translated into '" << language << "': " << translation << std::endl;
        }
    }
}

Manual synthesis

You can use the Translations dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer instance, the SpeechConfig object needs to have its SetSpeechSynthesisVoiceName property set to the desired voice.

The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.

void translateSpeech() {
    auto speechTranslationConfig =
        SpeechTranslationConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    auto fromLanguage = "en-US";
    auto toLanguages = { "de", "en", "it", "pt", "zh-Hans" };
    speechTranslationConfig->SetSpeechRecognitionLanguage(fromLanguage);
    for (auto language : toLanguages) {
        speechTranslationConfig->AddTargetLanguage(language);
    }

    auto translationRecognizer = TranslationRecognizer::FromConfig(translationConfig);

    cout << "Say something in '" << fromLanguage << "' and we'll translate...\n";

    auto result = translationRecognizer->RecognizeOnceAsync().get();
    if (result->Reason == ResultReason::TranslatedSpeech)
    {
        map<string, string> languageToVoiceMap;
        languageToVoiceMap["de"] = "de-DE-KatjaNeural";
        languageToVoiceMap["en"] = "en-US-AriaNeural";
        languageToVoiceMap["it"] = "it-IT-ElsaNeural";
        languageToVoiceMap["pt"] = "pt-BR-FranciscaNeural";
        languageToVoiceMap["zh-Hans"] = "zh-CN-XiaoxiaoNeural";

        cout << "Recognized: \"" << result->Text << "\"" << std::endl;
        for (auto pair : result->Translations)
        {
            auto language = pair.first;
            auto translation = pair.second;
            cout << "Translated into '" << language << "': " << translation << std::endl;

            auto speechConfig =
                SpeechConfig::FromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
            speechConfig->SetSpeechSynthesisVoiceName(languageToVoiceMap[language]);

            auto audioConfig = AudioConfig::FromWavFileOutput(language + "-translation.wav");
            auto speechSynthesizer = SpeechSynthesizer::FromConfig(speechConfig, audioConfig);

            speechSynthesizer->SpeakTextAsync(translation).get();
        }
    }
}

For more information about speech synthesis, see the basics of speech synthesis.

Multilingual translation with language identification

The following example anticipates that en-US or zh-CN should be detected because they're defined in AutoDetectSourceLanguageConfig. Then, the speech will be translated to de and fr as specified in the calls to AddTargetLanguage().

speechTranslationConfig->AddTargetLanguage("de");
speechTranslationConfig->AddTargetLanguage("fr");
auto autoDetectSourceLanguageConfig = AutoDetectSourceLanguageConfig::FromLanguages({ "en-US", "zh-CN" });
auto translationRecognizer = TranslationRecognizer::FromConfig(speechTranslationConfig, autoDetectSourceLanguageConfig, audioConfig);

For a complete code sample, see language identification.

Reference documentation | Package (Go) | Additional samples on GitHub

The Speech SDK for Go does not support speech translation. Please select another programming language or the Go reference and samples linked from the beginning of this article.

Reference documentation | Additional samples on GitHub

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Sensitive data and environment variables

The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's key and region. The Java code file contains two static final String values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY and SPEECH__SERVICE__REGION. Both of these fields are at the class scope, so they're accessible within method bodies of the class:

public class App {

    static final String SPEECH__SUBSCRIPTION__KEY = System.getenv("SPEECH__SUBSCRIPTION__KEY");
    static final String SPEECH__SERVICE__REGION = System.getenv("SPEECH__SERVICE__REGION");

    public static void main(String[] args) { }
}

For more information on environment variables, see Environment variables and application configuration.

Important

For more information about AI services security, see Authenticate requests to Azure AI services.

Create a speech translation configuration

Tip

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

You can initialize a SpeechTranslationConfig instance in a few ways:

With a subscription: pass in a key and the associated region.
With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
With a host: pass in a host address. A key or authorization token is optional.
With an authorization token: pass in an authorization token and the associated region.

Let's look at how you create a SpeechTranslationConfig instance by using a key and region. Get the Speech resource key and region in the Azure portal.

public class App {

    static final String SPEECH__SUBSCRIPTION__KEY = System.getenv("SPEECH__SERVICE__KEY");
    static final String SPEECH__SERVICE__REGION = System.getenv("SPEECH__SERVICE__REGION");

    public static void main(String[] args) {
        try {
            translateSpeech();
            System.exit(0);
        } catch (Exception ex) {
            System.out.println(ex);
            System.exit(1);
        }
    }

    static void translateSpeech() {
        SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
            SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    }
}

Change the source language

static void translateSpeech() {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    // Source (input) language
    speechTranslationConfig.setSpeechRecognitionLanguage("it-IT");
}

The setSpeechRecognitionLanguage function expects a language-locale format string. Refer to the list of supported speech translation locales.

Add a translation language

static void translateSpeech() {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    speechTranslationConfig.setSpeechRecognitionLanguage("it-IT");

    // Translate to languages. See https://aka.ms/speech/sttt-languages
    speechTranslationConfig.addTargetLanguage("fr");
    speechTranslationConfig.addTargetLanguage("de");
}

With every call to addTargetLanguage, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.

Initialize a translation recognizer

If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer should look like:

static void translateSpeech() {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    String fromLanguage = "en-US";
    String[] toLanguages = { "it", "fr", "de" };
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    for (String language : toLanguages) {
        speechTranslationConfig.addTargetLanguage(language);
    }

    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
    }
}

If you want to specify the audio input device, then you need to create an AudioConfig class instance and provide the audioConfig parameter when initializing TranslationRecognizer.

Tip

Learn how to get the device ID for your audio input device.

First, reference the AudioConfig object as follows:

static void translateSpeech() {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    String fromLanguage = "en-US";
    String[] toLanguages = { "it", "fr", "de" };
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    for (String language : toLanguages) {
        speechTranslationConfig.addTargetLanguage(language);
    }

    AudioConfig audioConfig = AudioConfig.fromDefaultMicrophoneInput();
    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig)) {
        
    }
}

If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig parameter. However, when you create an AudioConfig class instance, instead of calling fromDefaultMicrophoneInput, you call fromWavFileInput and pass the filename parameter:

static void translateSpeech() {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    String fromLanguage = "en-US";
    String[] toLanguages = { "it", "fr", "de" };
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    for (String language : toLanguages) {
        speechTranslationConfig.addTargetLanguage(language);
    }

    AudioConfig audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig)) {
        
    }
}

Translate speech

static void translateSpeech() throws ExecutionException, InterruptedException {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    String fromLanguage = "en-US";
    String[] toLanguages = { "it", "fr", "de" };
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    for (String language : toLanguages) {
        speechTranslationConfig.addTargetLanguage(language);
    }

    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
        System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);

        TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
        if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
            System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
            for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
                System.out.printf("Translated into '%s': %s\n", pair.getKey(), pair.getValue());
            }
        }
    }
}

For more information about speech to text, see the basics of speech recognition.

Synthesize translations

After a successful speech recognition and translation, the result contains all the translations in a dictionary. The getTranslations function returns a dictionary with the key as the target translation language and the value as the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).

Event-based synthesis

The TranslationRecognizer object exposes a synthesizing event. The event fires several times and provides a mechanism to retrieve the synthesized audio from the translation recognition result. If you're translating to multiple languages, see Manual synthesis.

Specify the synthesis voice by assigning a setVoiceName instance, and provide an event handler for the synthesizing event to get the audio. The following example saves the translated audio as a .wav file.

Important

The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the setVoiceName value should be the same language as the target translation language. For example, "de" could map to "de-DE-Hedda".

static void translateSpeech() throws ExecutionException, FileNotFoundException, InterruptedException, IOException {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);

    String fromLanguage = "en-US";
    String toLanguage = "de";
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    speechTranslationConfig.addTargetLanguage(toLanguage);

    // See: https://aka.ms/speech/sdkregion#standard-and-neural-voices
    speechTranslationConfig.setVoiceName("de-DE-Hedda");

    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
        translationRecognizer.synthesizing.addEventListener((s, e) -> {
            byte[] audio = e.getResult().getAudio();
            int size = audio.length;
            System.out.println("Audio synthesized: " + size + " byte(s)" + (size == 0 ? "(COMPLETE)" : ""));

            if (size > 0) {
                try (FileOutputStream file = new FileOutputStream("translation.wav")) {
                    file.write(audio);
                } catch (IOException ex) {
                    ex.printStackTrace();
                }
            }
        });

        System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);

        TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
        if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
            System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
            for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
                String language = pair.getKey();
                String translation = pair.getValue();
                System.out.printf("Translated into '%s': %s\n", language, translation);
            }
        }
    }
}

Manual synthesis

The getTranslations function returns a dictionary that you can use to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer instance, the SpeechConfig object needs to have its setSpeechSynthesisVoiceName property set to the desired voice.

The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.

static void translateSpeech() throws ExecutionException, InterruptedException {
    SpeechTranslationConfig speechTranslationConfig = SpeechTranslationConfig.fromSubscription(
        SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
    
    String fromLanguage = "en-US";
    String[] toLanguages = { "de", "en", "it", "pt", "zh-Hans" };
    speechTranslationConfig.setSpeechRecognitionLanguage(fromLanguage);
    for (String language : toLanguages) {
        speechTranslationConfig.addTargetLanguage(language);
    }

    try (TranslationRecognizer translationRecognizer = new TranslationRecognizer(speechTranslationConfig)) {
        System.out.printf("Say something in '%s' and we'll translate...", fromLanguage);

        TranslationRecognitionResult translationRecognitionResult = translationRecognizer.recognizeOnceAsync().get();
        if (translationRecognitionResult.getReason() == ResultReason.TranslatedSpeech) {
            // See: https://aka.ms/speech/sdkregion#standard-and-neural-voices
            Map<String, String> languageToVoiceMap = new HashMap<String, String>();
            languageToVoiceMap.put("de", "de-DE-KatjaNeural");
            languageToVoiceMap.put("en", "en-US-AriaNeural");
            languageToVoiceMap.put("it", "it-IT-ElsaNeural");
            languageToVoiceMap.put("pt", "pt-BR-FranciscaNeural");
            languageToVoiceMap.put("zh-Hans", "zh-CN-XiaoxiaoNeural");

            System.out.printf("Recognized: \"%s\"\n", translationRecognitionResult.getText());
            for (Map.Entry<String, String> pair : translationRecognitionResult.getTranslations().entrySet()) {
                String language = pair.getKey();
                String translation = pair.getValue();
                System.out.printf("Translated into '%s': %s\n", language, translation);

                SpeechConfig speechConfig =
                    SpeechConfig.fromSubscription(SPEECH__SUBSCRIPTION__KEY, SPEECH__SERVICE__REGION);
                speechConfig.setSpeechSynthesisVoiceName(languageToVoiceMap.get(language));

                AudioConfig audioConfig = AudioConfig.fromWavFileOutput(language + "-translation.wav");
                try (SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig)) {
                    speechSynthesizer.SpeakTextAsync(translation).get();
                }
            }
        }
    }
}

For more information about speech synthesis, see the basics of speech synthesis.

Reference documentation | Package (npm) | Additional samples on GitHub | Library source code

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Create a translation configuration

To call the translation service by using the Speech SDK, you need to create a SpeechTranslationConfig instance. This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

Note

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

You can initialize SpeechTranslationConfig in a few ways:

With a subscription: pass in a key and the associated region.
With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
With a host: pass in a host address. A key or authorization token is optional.
With an authorization token: pass in an authorization token and the associated region.

Let's look at how you create a SpeechTranslationConfig instance by using a key and region. Get the Speech resource key and region in the Azure portal.

const speechTranslationConfig = SpeechTranslationConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

Initialize a translator

If you're translating speech provided through your device's default microphone, here's what TranslationRecognizer should look like:

const translationRecognizer = new TranslationRecognizer(speechTranslationConfig);

If you want to specify the audio input device, then you need to create an AudioConfig class instance and provide the audioConfig parameter when initializing TranslationRecognizer.

Tip

Learn how to get the device ID for your audio input device.

Reference the AudioConfig object as follows:

const audioConfig = AudioConfig.fromDefaultMicrophoneInput();
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);

If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig parameter. However, you can do this only when you're targeting Node.js. When you create an AudioConfig class instance, instead of calling fromDefaultMicrophoneInput, you call fromWavFileOutput and pass the filename parameter:

const audioConfig = AudioConfig.fromWavFileInput("YourAudioFile.wav");
const translationRecognizer = new TranslationRecognizer(speechTranslationConfig, audioConfig);

Translate speech

The TranslationRecognizer class for the Speech SDK for JavaScript exposes methods that you can use for speech translation:

Single-shot translation (async): Performs translation in a nonblocking (asynchronous) mode. It translates a single utterance. It determines the end of a single utterance by listening for silence at the end or until a maximum of 15 seconds of audio is processed.
Continuous translation (async): Asynchronously initiates a continuous translation operation. The user registers to events and handles various application states. To stop asynchronous continuous translation, call stopContinuousRecognitionAsync.

To learn more about how to choose a speech recognition mode, see Get started with speech to text.

Specify a target language

To translate, you must specify both a source language and at least one target language.

You can choose a source language by using a locale listed in the Speech translation table. Find your options for translated language at the same link.

Your options for target languages differ when you want to view text or you want to hear synthesized translated speech. To translate from English to German, modify the translation configuration object:

speechTranslationConfig.speechRecognitionLanguage = "en-US";
speechTranslationConfig.addTargetLanguage("de");

Single-shot recognition

Here's an example of asynchronous single-shot translation via recognizeOnceAsync:

translationRecognizer.recognizeOnceAsync(result => {
    // Interact with result
});

You need to write some code to handle the result. This sample evaluates result.reason for a translation to German:

translationRecognizer.recognizeOnceAsync(
  function (result) {
    let translation = result.translations.get("de");
    window.console.log(translation);
    translationRecognizer.close();
  },
  function (err) {
    window.console.log(err);
    translationRecognizer.close();
});

Your code can also handle updates provided while the translation is processing. You can use these updates to provide visual feedback about the translation progress. This JavaScript Node.js example shows these kinds of updates. The following code also displays details produced during the translation process:

translationRecognizer.recognizing = function (s, e) {
    var str = ("(recognizing) Reason: " + SpeechSDK.ResultReason[e.result.reason] +
            " Text: " +  e.result.text +
            " Translation:");
    str += e.result.translations.get("de");
    console.log(str);
};
translationRecognizer.recognized = function (s, e) {
    var str = "\r\n(recognized)  Reason: " + SpeechSDK.ResultReason[e.result.reason] +
            " Text: " + e.result.text +
            " Translation:";
    str += e.result.translations.get("de");
    str += "\r\n";
    console.log(str);
};

Continuous translation

Continuous translation is a bit more involved than single-shot recognition. It requires you to subscribe to the recognizing, recognized, and canceled events to get the recognition results. To stop translation, you must call stopContinuousRecognitionAsync.

Here's an example of how continuous translation is performed on an audio input file. Let's start by defining the input and initializing TranslationRecognizer:

const translationRecognizer = new TranslationRecognizer(speechTranslationConfig);

In the following code, you subscribe to the events sent from TranslationRecognizer:

recognizing: Signal for events that contain intermediate translation results.
recognized: Signal for events that contain final translation results. These results indicate a successful translation attempt.
sessionStopped: Signal for events that indicate the end of a translation session (operation).
canceled: Signal for events that contain canceled translation results. These events indicate a translation attempt that was canceled as a result of a direct cancellation. Alternatively, they indicate a transport or protocol failure.

translationRecognizer.recognizing = (s, e) => {
    console.log(`TRANSLATING: Text=${e.result.text}`);
};
translationRecognizer.recognized = (s, e) => {
    if (e.result.reason == ResultReason.RecognizedSpeech) {
        console.log(`TRANSLATED: Text=${e.result.text}`);
    }
    else if (e.result.reason == ResultReason.NoMatch) {
        console.log("NOMATCH: Speech could not be translated.");
    }
};
translationRecognizer.canceled = (s, e) => {
    console.log(`CANCELED: Reason=${e.reason}`);
    if (e.reason == CancellationReason.Error) {
        console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
        console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
        console.log("CANCELED: Did you set the speech resource key and region values?");
    }
    translationRecognizer.stopContinuousRecognitionAsync();
};
translationRecognizer.sessionStopped = (s, e) => {
    console.log("\n    Session stopped event.");
    translationRecognizer.stopContinuousRecognitionAsync();
};

With everything set up, you can call startContinuousRecognitionAsync:

// Starts continuous recognition. Uses stopContinuousRecognitionAsync() to stop recognition.
translationRecognizer.startContinuousRecognitionAsync();
// Something later can call. Stops recognition.
// translationRecognizer.StopContinuousRecognitionAsync();

Choose a source language

A common task for speech translation is specifying the input (or source) language. The following example shows how you would change the input language to Italian. In your code, find your SpeechTranslationConfig instance and add the following line directly below it:

speechTranslationConfig.speechRecognitionLanguage = "it-IT";

The speechRecognitionLanguage property expects a language-locale format string. Refer to the list of supported speech translation locales.

Choose one or more target languages

The Speech SDK can translate to multiple target languages in parallel. The available target languages are somewhat different from the source language list. You specify target languages by using a language code, rather than a locale.

For a list of language codes for text targets, see the speech translation table on the language support page. You can also find details about translation to synthesized languages there.

The following code adds German as a target language:

speechTranslationConfig.addTargetLanguage("de");

Because multiple target language translations are possible, your code must specify the target language when examining the result. The following code gets translation results for German:

translationRecognizer.recognized = function (s, e) {
    var str = "\r\n(recognized)  Reason: " +
            sdk.ResultReason[e.result.reason] +
            " Text: " + e.result.text + " Translations:";
    var language = "de";
    str += " [" + language + "] " + e.result.translations.get(language);
    str += "\r\n";
    // show str somewhere
};

Reference documentation | Package (download) | Additional samples on GitHub

The Speech SDK for Objective-C does support speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Objective-C reference and samples linked from the beginning of this article.

Reference documentation | Package (download) | Additional samples on GitHub

The Speech SDK for Swift does support speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Swift reference and samples linked from the beginning of this article.

Reference documentation | Package (PyPi) | Additional samples on GitHub

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Sensitive data and environment variables

The example source code in this article depends on environment variables for storing sensitive data, such as the Speech resource's subscription key and region. The Python code file contains two values that are assigned from the host machine's environment variables: SPEECH__SUBSCRIPTION__KEY and SPEECH__SERVICE__REGION. Both of these variables are at the global scope, so they're accessible within the function definition of the code file:

speech_key, service_region = os.environ['SPEECH__SUBSCRIPTION__KEY'], os.environ['SPEECH__SERVICE__REGION']

For more information on environment variables, see Environment variables and application configuration.

Important

For more information about AI services security, see Authenticate requests to Azure AI services.

Create a speech translation configuration

Tip

Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

You can initialize SpeechTranslationConfig in a few ways:

With a subscription: pass in a key and the associated region.
With an endpoint: pass in a Speech service endpoint. A key or authorization token is optional.
With a host: pass in a host address. A key or authorization token is optional.
With an authorization token: pass in an authorization token and the associated region.

Let's look at how you can create a SpeechTranslationConfig instance by using a key and region. Get the Speech resource key and region in the Azure portal.

from_language, to_language = 'en-US', 'de'

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

Change the source language

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    # Source (input) language
    from_language = "it-IT"
    translation_config.speech_recognition_language = from_language

The speech_recognition_language property expects a language-locale format string. Refer to the list of supported speech translation locales.

Add a translation language

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = "it-IT"

    # Translate to languages. See, https://aka.ms/speech/sttt-languages
    translation_config.add_target_language("fr")
    translation_config.add_target_language("de")

With every call to add_target_language, a new target translation language is specified. In other words, when speech is recognized from the source language, each target translation is available as part of the resulting translation operation.

Initialize a translation recognizer

After you created a SpeechTranslationConfig instance, the next step is to initialize TranslationRecognizer. When you initialize TranslationRecognizer, you need to pass it your translation_config instance. The configuration object provides the credentials that the Speech service requires to validate your request.

If you're recognizing speech by using your device's default microphone, here's what TranslationRecognizer should look like:

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    translation_config.add_target_language(to_language)

    translation_recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config)

If you want to specify the audio input device, then you need to create an AudioConfig class instance and provide the audio_config parameter when initializing TranslationRecognizer.

Tip

Learn how to get the device ID for your audio input device.

First, reference the AudioConfig object as follows:

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    for lang in to_languages:
        translation_config.add_target_language(lang)

    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    translation_recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config, audio_config=audio_config)

If you want to provide an audio file instead of using a microphone, you still need to provide an audioConfig parameter. However, when you create an AudioConfig class instance, instead of calling with use_default_microphone=True, you call with filename="path-to-file.wav" and provide the filename parameter:

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    for lang in to_languages:
        translation_config.add_target_language(lang)

    audio_config = speechsdk.audio.AudioConfig(filename="path-to-file.wav")
    translation_recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config, audio_config=audio_config)

Translate speech

import os
import azure.cognitiveservices.speech as speechsdk

speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_languages = 'en-US', 'de'

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    translation_config.add_target_language(to_language)

    translation_recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config)
    
    print('Say something...')
    translation_recognition_result = translation_recognizer.recognize_once()
    print(get_result_text(reason=translation_recognition_result.reason, result=translation_recognition_result))

def get_result_text(reason, translation_recognition_result):
    reason_format = {
        speechsdk.ResultReason.TranslatedSpeech:
            f'RECOGNIZED "{from_language}": {translation_recognition_result.text}\n' +
            f'TRANSLATED into "{to_language}"": {translation_recognition_result.translations[to_language]}',
        speechsdk.ResultReason.RecognizedSpeech: f'Recognized: "{translation_recognition_result.text}"',
        speechsdk.ResultReason.NoMatch: f'No speech could be recognized: {translation_recognition_result.no_match_details}',
        speechsdk.ResultReason.Canceled: f'Speech Recognition canceled: {translation_recognition_result.cancellation_details}'
    }
    return reason_format.get(reason, 'Unable to recognize speech')

translate_speech_to_text()

For more information about speech to text, see the basics of speech recognition.

Synthesize translations

After a successful speech recognition and translation, the result contains all the translations in a dictionary. The translations dictionary key is the target translation language, and the value is the translated text. Recognized speech can be translated and then synthesized in a different language (speech-to-speech).

Event-based synthesis

Specify the synthesis voice by assigning a voice_name instance, and provide an event handler for the Synthesizing event to get the audio. The following example saves the translated audio as a .wav file.

Important

The event-based synthesis works only with a single translation. Do not add multiple target translation languages. Additionally, the voice_name value should be the same language as the target translation language. For example, "de" could map to "de-DE-Hedda".

import os
import azure.cognitiveservices.speech as speechsdk

speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_language = 'en-US', 'de'

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    translation_config.add_target_language(to_language)

    # See: https://aka.ms/speech/sdkregion#standard-and-neural-voices
    translation_config.voice_name = "de-DE-Hedda"

    translation_recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config)

    def synthesis_callback(evt):
        size = len(evt.result.audio)
        print(f'Audio synthesized: {size} byte(s) {"(COMPLETED)" if size == 0 else ""}')

        if size > 0:
            file = open('translation.wav', 'wb+')
            file.write(evt.result.audio)
            file.close()

    translation_recognizer.synthesizing.connect(synthesis_callback)

    print(f'Say something in "{from_language}" and we\'ll translate into "{to_language}".')

    translation_recognition_result = translation_recognizer.recognize_once()
    print(get_result_text(reason=translation_recognition_result.reason, result=translation_recognition_result))

def get_result_text(reason, translation_recognition_result):
    reason_format = {
        speechsdk.ResultReason.TranslatedSpeech:
            f'Recognized "{from_language}": {translation_recognition_result.text}\n' +
            f'Translated into "{to_language}"": {translation_recognition_result.translations[to_language]}',
        speechsdk.ResultReason.RecognizedSpeech: f'Recognized: "{translation_recognition_result.text}"',
        speechsdk.ResultReason.NoMatch: f'No speech could be recognized: {translation_recognition_result.no_match_details}',
        speechsdk.ResultReason.Canceled: f'Speech Recognition canceled: {translation_recognition_result.cancellation_details}'
    }
    return reason_format.get(reason, 'Unable to recognize speech')

translate_speech_to_text()

Manual synthesis

You can use the translations dictionary to synthesize audio from the translation text. Iterate through each translation and synthesize it. When you're creating a SpeechSynthesizer instance, the SpeechConfig object needs to have its speech_synthesis_voice_name property set to the desired voice.

The following example translates to five languages. Each translation is then synthesized to an audio file in the corresponding neural language.

import os
import azure.cognitiveservices.speech as speechsdk

speech_key, service_region = os.environ['SPEECH__SERVICE__KEY'], os.environ['SPEECH__SERVICE__REGION']
from_language, to_languages = 'en-US', [ 'de', 'en', 'it', 'pt', 'zh-Hans' ]

def translate_speech_to_text():
    translation_config = speechsdk.translation.SpeechTranslationConfig(
            subscription=speech_key, region=service_region)

    translation_config.speech_recognition_language = from_language
    for lang in to_languages:
        translation_config.add_target_language(lang)

    recognizer = speechsdk.translation.TranslationRecognizer(
            translation_config=translation_config)
    
    print('Say something...')
    translation_recognition_result = translation_recognizer.recognize_once()
    synthesize_translations(result=translation_recognition_result)

def synthesize_translations(translation_recognition_result):
    language_to_voice_map = {
        "de": "de-DE-KatjaNeural",
        "en": "en-US-AriaNeural",
        "it": "it-IT-ElsaNeural",
        "pt": "pt-BR-FranciscaNeural",
        "zh-Hans": "zh-CN-XiaoxiaoNeural"
    }
    print(f'Recognized: "{translation_recognition_result.text}"')

    for language in translation_recognition_result.translations:
        translation = translation_recognition_result.translations[language]
        print(f'Translated into "{language}": {translation}')

        speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
        speech_config.speech_synthesis_voice_name = language_to_voice_map.get(language)
        
        audio_config = speechsdk.audio.AudioOutputConfig(filename=f'{language}-translation.wav')
        speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
        speech_synthesizer.speak_text_async(translation).get()

translate_speech_to_text()

For more information about speech synthesis, see the basics of speech synthesis.

Multi-lingual translation with language identification

For a complete code sample, see language identification.

Speech to text REST API reference | Speech to text REST API for short audio reference | Additional samples on GitHub

You can use the REST API for speech translation, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts.

In this how-to guide, you learn how to recognize human speech and translate it to another language.

See the speech translation overview for more information about:

Translating speech to text
Translating speech to multiple target languages
Performing direct speech to speech translation

Prerequisites

An Azure subscription. You can create one for free.
Create a Speech resource in the Azure portal.
Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.

Download and install

Follow these steps and see the Speech CLI quickstart for other requirements for your platform.

Run the following .NET CLI command to install the Speech CLI:

dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI

Run the following commands to configure your Speech resource key and region. Replace SUBSCRIPTION-KEY with your Speech resource key and replace REGION with your Speech resource region.
- Terminal
- PowerShell
```
spx config @key --set SUBSCRIPTION-KEY
spx config @region --set REGION
```
```
spx --% config @key --set SUBSCRIPTION-KEY
spx --% config @region --set REGION
```

Set source and target languages

This command calls the Speech CLI to translate speech from the microphone from Italian to French:

spx translate --microphone --source it-IT --target fr

Jaa

How to recognize and translate speech

Sensitive data and environment variables

Create a speech translation configuration

Change the source language

Add a translation language

Initialize a translation recognizer

Translate speech

Event based translation

Synthesize translations

Event-based synthesis

Manual synthesis

Multi-lingual translation with language identification

Multi-lingual speech translation without source language candidates

Using custom translation in speech translation

Sensitive data and environment variables

Create a speech translation configuration

Change the source language

Add a translation language

Initialize a translation recognizer

Translate speech

Synthesize translations

Event-based synthesis

Manual synthesis

Multilingual translation with language identification

Sensitive data and environment variables

Create a speech translation configuration

Change the source language

Add a translation language

Initialize a translation recognizer

Translate speech

Synthesize translations

Event-based synthesis

Manual synthesis

Create a translation configuration

Initialize a translator

Translate speech

Specify a target language

Single-shot recognition

Continuous translation

Choose a source language

Choose one or more target languages

Sensitive data and environment variables

Create a speech translation configuration

Change the source language

Add a translation language

Initialize a translation recognizer

Translate speech

Synthesize translations

Event-based synthesis

Manual synthesis

Multi-lingual translation with language identification

Prerequisites

Download and install

Set source and target languages

Next steps

Palaute

Lisäresursseja