Does Azure speech allow low latency input streaming? In other words, can we listen to LLMs in real-time as the text is being generated.

Nitish Kumar 50 Reputation points
2023-10-07T13:16:11.5666667+00:00

Hello,

Azure OpenAI can generate chunk of text in stream without waiting for full response.

Recently, ElevanLab API allow low latency input streaming

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
{count} vote

Accepted answer
  1. dupammi 8,615 Reputation points Microsoft External Staff
    2023-10-11T11:07:30.5433333+00:00

    Hi @Nitish Kumar ,

    Following up to see my above "comment" answer helps by checking the comments section of this thread. Do let us know if you have any queries.

    To reiterate the resolution here, let me jot down the gist of my comment answer above.

    Yes, it is possible that the Azure OpenAI's GPT-3 can generate text in chunks without waiting for the full response, allowing for a more interactive and real-time conversation.

    Please have a look at the sample implementation done in the above "comment".

    Please 'Accept as answer' and ‘Upvote’ if it helped so that it can help others in the community looking for help on similar topics. Thank you!

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. dupammi 8,615 Reputation points Microsoft External Staff
    2023-10-09T10:44:54.71+00:00

    Hi @Nitish Kumar ,

    Thank you for your question about the Azure speech SDK to allow low latency input streaming, with which, we can listen to LLMs in real-time as the text is being generated. I will be happy to assist you regarding this.

    Regarding your query, it is possible that the Azure OpenAI's GPT-3 can generate text in chunks without waiting for the full response, allowing for a more interactive and real-time conversation.

    To achieve real-time speech synthesis with continuous chunks of text using the Azure Text-to-Speech API, you would need to implement streaming yourself.

    Here's a general python approach you can follow using Azure OpenAI and Azure Speech SDK:

    import os
    import openai
    import azure.cognitiveservices.speech as speechsdk
    
    # Set up OpenAI and Azure Text-to-Speech configuration (as you've already done)
    openai.api_type = "azure"
    openai.api_base = "OPENAI_API_BASE"
    openai.api_version = "2023-07-01-preview"
    openai.api_key = "OPENAI_API_KEY" 
    #os.getenv("OPENAI_API_KEY")
    def generate_chat_completion(prompt):
        # Generate text chunk using OpenAI GPT-3
        response = openai.ChatCompletion.create(
            engine="test_Chatgpt",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=800,
            top_p=0.95,
            frequency_penalty=0,
            presence_penalty=0,
            stop=None
        )
        return response.choices[0].message["content"]
    
    def synthesize_and_stream(text_chunks):
        speech_config = speechsdk.SpeechConfig(subscription="YOUR_AZURE_SUBSCRIPTION_KEY", region="YOUR_AZURE_SUBSCRIPTION_REGION")
        speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    
        for chunk in text_chunks:
            result = speech_synthesizer.speak_text_async(chunk).get()
    
            if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
                print("Speech synthesized for text [{}]".format(chunk))
    # Handle the audio output here (e.g., play or save to a file)
            elif result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = result.cancellation_details
                print("Speech synthesis canceled: {}".format(cancellation_details.reason))
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    if cancellation_details.error_details:
                        print("Error details: {}".format(cancellation_details.error_details))
                print("Did you update the subscription info?")
    
    # Example usage
    user_input = "A long text relaxing speech, that needs to be synthesized in chunks"
    chunk_size = 200  # Define the size of each text chunk
    text_chunks = [user_input[i:i + chunk_size] for i in range(0, len(user_input), chunk_size)]
    
    # Generate chat completion
    for chunk in text_chunks:
        chat_response = generate_chat_completion(chunk)
    
        # Synthesize speech for the chat response
        synthesize_and_stream([chat_response])
    

    Please replace the placeholders in above code with your actual Azure Text-to-Speech subscription key etc. This code breaks the input text into chunks and synthesizes each chunk individually, allowing you to manage the streaming of continuous text effectively.

    Keep in mind that this is a simplified example that was tried at my end based on documentation, above python code snippet. You may need to fine-tune it based on your specific requirements and integration with your application.

    Here is the step-by-step explanation of above Python implementation:

    1. Break your text into smaller chunks: Divide the text you want to synthesize into smaller, manageable chunks.
    2. Send each chunk for synthesis: Send each chunk of text to the Text-to-Speech API for synthesis. You would typically use the speak_text_async method for each chunk.
    3. Handle the audio output: As each chunk is synthesized, you will receive audio output. You can then play this audio output or save it to a file, depending on your application's requirements.
    4. Manage latency: To achieve low latency, you will need to ensure that you start processing the next chunk of text while the previous one is being synthesized. This way, you can achieve a more real-time experience.

    Below is the output I got at my end, by following the steps mentioned in the documentation link below.

    Note: The output is as per text stream passed as input. Voice was also fine at my end for below text stream. For some reason, I couldn't attach the audio file here. I hope you understand.

    OutputUser's image

    Please have a look into the below documentation for more details:

    How to lower speech synthesis latency using Speech SDK - Azure AI services | Microsoft Learn

    I hope this information helps! Check and let me know, how it works.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.