Azure OpenAI speech to speech chat

Reference documentation | Package (NuGet) | Additional Samples on GitHub

Important

To complete the steps in this guide, access must be granted to Microsoft Azure OpenAI Service in the desired Azure subscription. Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at https://aka.ms/oai/access.

In this how-to guide, you can use Azure AI Speech to converse with Azure OpenAI Service. The text recognized by the Speech service is sent to Azure OpenAI. The text response from Azure OpenAI is then synthesized by the Speech service.

Speak into the microphone to start a conversation with Azure OpenAI.

  • The Speech service recognizes your speech and converts it into text (speech to text).
  • Your request as text is sent to Azure OpenAI.
  • The Speech service text to speech (TTS) feature synthesizes the response from Azure OpenAI to the default speaker.

Although the experience of this example is a back-and-forth exchange, Azure OpenAI doesn't remember the context of your conversation.

Prerequisites

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the SDK installation guide for any more requirements.

Set environment variables

This example requires environment variables named OPEN_AI_KEY, OPEN_AI_ENDPOINT, SPEECH_KEY, and SPEECH_REGION.

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine running the application.

Tip

Don't include the key directly in your code, and never post it publicly. See the Azure AI services security article for more authentication options like Azure Key Vault.

To set the environment variables, open a console window, and follow the instructions for your operating system and development environment.

  • To set the OPEN_AI_KEY environment variable, replace your-openai-key with one of the keys for your resource.
  • To set the OPEN_AI_ENDPOINT environment variable, replace your-openai-endpoint with one of the regions for your resource.
  • To set the SPEECH_KEY environment variable, replace your-speech-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-speech-region with one of the regions for your resource.
setx OPEN_AI_KEY your-openai-key
setx OPEN_AI_ENDPOINT your-openai-endpoint
setx SPEECH_KEY your-speech-key
setx SPEECH_REGION your-speech-region

Note

If you only need to access the environment variable in the current running console, you can set the environment variable with set instead of setx.

After you add the environment variables, you may need to restart any running programs that will need to read the environment variable, including the console window. For example, if you are using Visual Studio as your editor, restart Visual Studio before running the example.

Recognize speech from a microphone

Follow these steps to create a new console application.

  1. Open a command prompt where you want the new project, and create a console application with the .NET CLI. The Program.cs file should be created in the project directory.

    dotnet new console
    
  2. Install the Speech SDK in your new project with the .NET CLI.

    dotnet add package Microsoft.CognitiveServices.Speech
    
  3. Install the Azure OpenAI SDK (prerelease) in your new project with the .NET CLI.

    dotnet add package Azure.AI.OpenAI --prerelease 
    
  4. Replace the contents of Program.cs with the following code.

    using System;
    using System.IO;
    using System.Threading.Tasks;
    using Microsoft.CognitiveServices.Speech;
    using Microsoft.CognitiveServices.Speech.Audio;
    using Azure;
    using Azure.AI.OpenAI;
    using static System.Environment;
    
    class Program 
    {
        // This example requires environment variables named "OPEN_AI_KEY" and "OPEN_AI_ENDPOINT"
        // Your endpoint should look like the following https://YOUR_OPEN_AI_RESOURCE_NAME.openai.azure.com/
        static string openAIKey = Environment.GetEnvironmentVariable("OPEN_AI_KEY");
        static string openAIEndpoint = Environment.GetEnvironmentVariable("OPEN_AI_ENDPOINT");
    
        // Enter the deployment name you chose when you deployed the model.
        static string engine = "text-davinci-003";
    
        // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
        static string speechKey = Environment.GetEnvironmentVariable("SPEECH_KEY");
        static string speechRegion = Environment.GetEnvironmentVariable("SPEECH_REGION");
    
        // Prompts Azure OpenAI with a request and synthesizes the response.
        async static Task AskOpenAI(string prompt)
        {
            // Ask Azure OpenAI
            OpenAIClient client = new(new Uri(openAIEndpoint), new AzureKeyCredential(openAIKey));
            var completionsOptions = new CompletionsOptions()
            {
                Prompts = { prompt },
                MaxTokens = 100,
            };
            Response<Completions> completionsResponse = client.GetCompletions(engine, completionsOptions);
            string text = completionsResponse.Value.Choices[0].Text.Trim();
            Console.WriteLine($"Azure OpenAI response: {text}");
    
            var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);
            // The language of the voice that speaks.
            speechConfig.SpeechSynthesisVoiceName = "en-US-JennyMultilingualNeural"; 
            var audioOutputConfig = AudioConfig.FromDefaultSpeakerOutput();
    
            using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, audioOutputConfig))
            {
                var speechSynthesisResult = await speechSynthesizer.SpeakTextAsync(text).ConfigureAwait(true);
    
                if (speechSynthesisResult.Reason == ResultReason.SynthesizingAudioCompleted)
                {
                    Console.WriteLine($"Speech synthesized to speaker for text: [{text}]");
                }
                else if (speechSynthesisResult.Reason == ResultReason.Canceled)
                {
                    var cancellationDetails = SpeechSynthesisCancellationDetails.FromResult(speechSynthesisResult);
                    Console.WriteLine($"Speech synthesis canceled: {cancellationDetails.Reason}");
    
                    if (cancellationDetails.Reason == CancellationReason.Error)
                    {
                        Console.WriteLine($"Error details: {cancellationDetails.ErrorDetails}");
                    }
                }
            }
        }
    
        // Continuously listens for speech input to recognize and send as text to Azure OpenAI
        async static Task ChatWithOpenAI()
        {
            // Should be the locale for the speaker's language.
            var speechConfig = SpeechConfig.FromSubscription(speechKey, speechRegion);        
            speechConfig.SpeechRecognitionLanguage = "en-US";
    
            using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
            using var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);
            var conversationEnded = false;
    
            while(!conversationEnded)
            {
                Console.WriteLine("Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.");
    
                // Get audio from the microphone and then send it to the TTS service.
                var speechRecognitionResult = await speechRecognizer.RecognizeOnceAsync();           
    
                switch (speechRecognitionResult.Reason)
                {
                    case ResultReason.RecognizedSpeech:
                        if (speechRecognitionResult.Text == "Stop.")
                        {
                            Console.WriteLine("Conversation ended.");
                            conversationEnded = true;
                        }
                        else
                        {
                            Console.WriteLine($"Recognized speech: {speechRecognitionResult.Text}");
                            await AskOpenAI(speechRecognitionResult.Text).ConfigureAwait(true);
                        }
                        break;
                    case ResultReason.NoMatch:
                        Console.WriteLine($"No speech could be recognized: ");
                        break;
                    case ResultReason.Canceled:
                        var cancellationDetails = CancellationDetails.FromResult(speechRecognitionResult);
                        Console.WriteLine($"Speech Recognition canceled: {cancellationDetails.Reason}");
                        if (cancellationDetails.Reason == CancellationReason.Error)
                        {
                            Console.WriteLine($"Error details={cancellationDetails.ErrorDetails}");
                        }
                        break;
                }
            }
        }
    
        async static Task Main(string[] args)
        {
            try
            {
                await ChatWithOpenAI().ConfigureAwait(true);
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }
    
  5. To increase or decrease the number of tokens returned by Azure OpenAI, change the MaxTokens property in the CompletionsOptions class instance. For more information tokens and cost implications, see Azure OpenAI tokens and Azure OpenAI pricing.

Run your new console application to start speech recognition from a microphone:

dotnet run

Important

Make sure that you set the OPEN_AI_KEY, OPEN_AI_ENDPOINT, SPEECH_KEY and SPEECH_REGION environment variables as described previously. If you don't set these variables, the sample will fail with an error message.

Speak into your microphone when prompted. The console output includes the prompt for you to begin speaking, then your request as text, and then the response from Azure OpenAI as text. The response from Azure OpenAI should be converted from text to speech and then output to the default speaker.

PS C:\dev\openai\csharp> dotnet run
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Recognized speech:Make a comma separated list of all continents.
Azure OpenAI response:Africa, Antarctica, Asia, Australia, Europe, North America, South America
Speech synthesized to speaker for text [Africa, Antarctica, Asia, Australia, Europe, North America, South America]
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Recognized speech: Make a comma separated list of 1 Astronomical observatory for each continent. A list should include each continent name in parentheses.
Azure OpenAI response:Mauna Kea Observatories (North America), La Silla Observatory (South America), Tenerife Observatory (Europe), Siding Spring Observatory (Australia), Beijing Xinglong Observatory (Asia), Naukluft Plateau Observatory (Africa), Rutherford Appleton Laboratory (Antarctica)
Speech synthesized to speaker for text [Mauna Kea Observatories (North America), La Silla Observatory (South America), Tenerife Observatory (Europe), Siding Spring Observatory (Australia), Beijing Xinglong Observatory (Asia), Naukluft Plateau Observatory (Africa), Rutherford Appleton Laboratory (Antarctica)]
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Conversation ended.
PS C:\dev\openai\csharp>

Remarks

Now that you've completed the quickstart, here are some more considerations:

  • To change the speech recognition language, replace en-US with another supported language. For example, es-ES for Spanish (Spain). The default language is en-US if you don't specify a language. For details about how to identify one of multiple languages that might be spoken, see language identification.
  • To change the voice that you hear, replace en-US-JennyMultilingualNeural with another supported voice. If the voice doesn't speak the language of the text returned from Azure OpenAI, the Speech service doesn't output synthesized audio.
  • To use a different model, replace text-davinci-003 with the ID of another deployment. Keep in mind that the deployment ID isn't necessarily the same as the model name. You named your deployment when you created it in Azure OpenAI Studio.
  • Azure OpenAI also performs content moderation on the prompt inputs and generated outputs. The prompts or responses may be filtered if harmful content is detected. For more information, see the content filtering article.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (PyPi) | Additional Samples on GitHub

Important

To complete the steps in this guide, access must be granted to Microsoft Azure OpenAI Service in the desired Azure subscription. Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at https://aka.ms/oai/access.

In this how-to guide, you can use Azure AI Speech to converse with Azure OpenAI Service. The text recognized by the Speech service is sent to Azure OpenAI. The text response from Azure OpenAI is then synthesized by the Speech service.

Speak into the microphone to start a conversation with Azure OpenAI.

  • The Speech service recognizes your speech and converts it into text (speech to text).
  • Your request as text is sent to Azure OpenAI.
  • The Speech service text to speech (TTS) feature synthesizes the response from Azure OpenAI to the default speaker.

Although the experience of this example is a back-and-forth exchange, Azure OpenAI doesn't remember the context of your conversation.

Prerequisites

Set up the environment

The Speech SDK for Python is available as a Python Package Index (PyPI) module. The Speech SDK for Python is compatible with Windows, Linux, and macOS.

Install a version of Python from 3.7 or later. First check the SDK installation guide for any more requirements.

Install the following Python libraries: os, requests, json

Set environment variables

This example requires environment variables named OPEN_AI_KEY, OPEN_AI_ENDPOINT, SPEECH_KEY, and SPEECH_REGION.

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine running the application.

Tip

Don't include the key directly in your code, and never post it publicly. See the Azure AI services security article for more authentication options like Azure Key Vault.

To set the environment variables, open a console window, and follow the instructions for your operating system and development environment.

  • To set the OPEN_AI_KEY environment variable, replace your-openai-key with one of the keys for your resource.
  • To set the OPEN_AI_ENDPOINT environment variable, replace your-openai-endpoint with one of the regions for your resource.
  • To set the SPEECH_KEY environment variable, replace your-speech-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-speech-region with one of the regions for your resource.
setx OPEN_AI_KEY your-openai-key
setx OPEN_AI_ENDPOINT your-openai-endpoint
setx SPEECH_KEY your-speech-key
setx SPEECH_REGION your-speech-region

Note

If you only need to access the environment variable in the current running console, you can set the environment variable with set instead of setx.

After you add the environment variables, you may need to restart any running programs that will need to read the environment variable, including the console window. For example, if you are using Visual Studio as your editor, restart Visual Studio before running the example.

Recognize speech from a microphone

Follow these steps to create a new console application.

  1. Open a command prompt where you want the new project, and create a new file named openai-speech.py.

  2. Run this command to install the Speech SDK:

    pip install azure-cognitiveservices-speech
    
  3. Run this command to install the OpenAI SDK:

    pip install openai
    

    Note

    This library is maintained by OpenAI (not Microsoft Azure). Refer to the release history or the version.py commit history to track the latest updates to the library.

  4. Copy the following code into openai-speech.py:

    import os
    import azure.cognitiveservices.speech as speechsdk
    import openai
    
    # This example requires environment variables named "OPEN_AI_KEY" and "OPEN_AI_ENDPOINT"
    # Your endpoint should look like the following https://YOUR_OPEN_AI_RESOURCE_NAME.openai.azure.com/
    openai.api_key = os.environ.get('OPEN_AI_KEY')
    openai.api_base =  os.environ.get('OPEN_AI_ENDPOINT')
    openai.api_type = 'azure'
    openai.api_version = '2022-12-01'
    
    # This will correspond to the custom name you chose for your deployment when you deployed a model.
    deployment_id='text-davinci-003' 
    
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))
    audio_output_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    
    # Should be the locale for the speaker's language.
    speech_config.speech_recognition_language="en-US"
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    
    # The language of the voice that responds on behalf of Azure OpenAI.
    speech_config.speech_synthesis_voice_name='en-US-JennyMultilingualNeural'
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output_config)
    
    # Prompts Azure OpenAI with a request and synthesizes the response.
    def ask_openai(prompt):
    
        # Ask Azure OpenAI
        response = openai.Completion.create(engine=deployment_id, prompt=prompt, max_tokens=100)
        text = response['choices'][0]['text'].replace('\n', ' ').replace(' .', '.').strip()
        print('Azure OpenAI response:' + text)
    
        # Azure text to speech output
        speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()
    
        # Check result
        if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesized to speaker for text [{}]".format(text))
        elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = speech_synthesis_result.cancellation_details
            print("Speech synthesis canceled: {}".format(cancellation_details.reason))
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print("Error details: {}".format(cancellation_details.error_details))
    
    # Continuously listens for speech input to recognize and send as text to Azure OpenAI
    def chat_with_open_ai():
        while True:
            print("Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.")
            try:
                # Get audio from the microphone and then send it to the TTS service.
                speech_recognition_result = speech_recognizer.recognize_once_async().get()
    
                # If speech is recognized, send it to Azure OpenAI and listen for the response.
                if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
                    if speech_recognition_result.text == "Stop.": 
                        print("Conversation ended.")
                        break
                    print("Recognized speech: {}".format(speech_recognition_result.text))
                    ask_openai(speech_recognition_result.text)
                elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
                    print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
                    break
                elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
                    cancellation_details = speech_recognition_result.cancellation_details
                    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
                    if cancellation_details.reason == speechsdk.CancellationReason.Error:
                        print("Error details: {}".format(cancellation_details.error_details))
            except EOFError:
                break
    
    # Main
    
    try:
        chat_with_open_ai()
    except Exception as err:
        print("Encountered exception. {}".format(err))
    
  5. To increase or decrease the number of tokens returned by Azure OpenAI, change the max_tokens parameter. For more information tokens and cost implications, see Azure OpenAI tokens and Azure OpenAI pricing.

Run your new console application to start speech recognition from a microphone:

python openai-speech.py

Important

Make sure that you set the OPEN_AI_KEY, OPEN_AI_ENDPOINT, SPEECH_KEY and SPEECH_REGION environment variables as described previously. If you don't set these variables, the sample will fail with an error message.

Speak into your microphone when prompted. The console output includes the prompt for you to begin speaking, then your request as text, and then the response from Azure OpenAI as text. The response from Azure OpenAI should be converted from text to speech and then output to the default speaker.

PS C:\dev\openai\python> python.exe .\openai-speech.py
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Recognized speech:Make a comma separated list of all continents.
Azure OpenAI response:Africa, Antarctica, Asia, Australia, Europe, North America, South America
Speech synthesized to speaker for text [Africa, Antarctica, Asia, Australia, Europe, North America, South America]
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Recognized speech: Make a comma separated list of 1 Astronomical observatory for each continent. A list should include each continent name in parentheses.
Azure OpenAI response:Mauna Kea Observatories (North America), La Silla Observatory (South America), Tenerife Observatory (Europe), Siding Spring Observatory (Australia), Beijing Xinglong Observatory (Asia), Naukluft Plateau Observatory (Africa), Rutherford Appleton Laboratory (Antarctica)
Speech synthesized to speaker for text [Mauna Kea Observatories (North America), La Silla Observatory (South America), Tenerife Observatory (Europe), Siding Spring Observatory (Australia), Beijing Xinglong Observatory (Asia), Naukluft Plateau Observatory (Africa), Rutherford Appleton Laboratory (Antarctica)]
Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.
Conversation ended.
PS C:\dev\openai\python> 

Remarks

Now that you've completed the quickstart, here are some more considerations:

  • To change the speech recognition language, replace en-US with another supported language. For example, es-ES for Spanish (Spain). The default language is en-US if you don't specify a language. For details about how to identify one of multiple languages that might be spoken, see language identification.
  • To change the voice that you hear, replace en-US-JennyMultilingualNeural with another supported voice. If the voice doesn't speak the language of the text returned from Azure OpenAI, the Speech service doesn't output synthesized audio.
  • To use a different model, replace text-davinci-003 with the ID of another deployment. Keep in mind that the deployment ID isn't necessarily the same as the model name. You named your deployment when you created it in Azure OpenAI Studio.
  • Azure OpenAI also performs content moderation on the prompt inputs and generated outputs. The prompts or responses may be filtered if harmful content is detected. For more information, see the content filtering article.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Next steps