Hi @Nitish Kumar ,
Thank you for your question about the Azure speech SDK to allow low latency input streaming, with which, we can listen to LLMs in real-time as the text is being generated. I will be happy to assist you regarding this.
Regarding your query, it is possible that the Azure OpenAI's GPT-3 can generate text in chunks without waiting for the full response, allowing for a more interactive and real-time conversation.
To achieve real-time speech synthesis with continuous chunks of text using the Azure Text-to-Speech API, you would need to implement streaming yourself.
Here's a general python approach you can follow using Azure OpenAI and Azure Speech SDK:
import os
import openai
import azure.cognitiveservices.speech as speechsdk
# Set up OpenAI and Azure Text-to-Speech configuration (as you've already done)
openai.api_type = "azure"
openai.api_base = "OPENAI_API_BASE"
openai.api_version = "2023-07-01-preview"
openai.api_key = "OPENAI_API_KEY"
#os.getenv("OPENAI_API_KEY")
def generate_chat_completion(prompt):
# Generate text chunk using OpenAI GPT-3
response = openai.ChatCompletion.create(
engine="test_Chatgpt",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=800,
top_p=0.95,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return response.choices[0].message["content"]
def synthesize_and_stream(text_chunks):
speech_config = speechsdk.SpeechConfig(subscription="YOUR_AZURE_SUBSCRIPTION_KEY", region="YOUR_AZURE_SUBSCRIPTION_REGION")
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
for chunk in text_chunks:
result = speech_synthesizer.speak_text_async(chunk).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(chunk))
# Handle the audio output here (e.g., play or save to a file)
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
if cancellation_details.error_details:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you update the subscription info?")
# Example usage
user_input = "A long text relaxing speech, that needs to be synthesized in chunks"
chunk_size = 200 # Define the size of each text chunk
text_chunks = [user_input[i:i + chunk_size] for i in range(0, len(user_input), chunk_size)]
# Generate chat completion
for chunk in text_chunks:
chat_response = generate_chat_completion(chunk)
# Synthesize speech for the chat response
synthesize_and_stream([chat_response])
Please replace the placeholders in above code with your actual Azure Text-to-Speech subscription key etc. This code breaks the input text into chunks and synthesizes each chunk individually, allowing you to manage the streaming of continuous text effectively.
Keep in mind that this is a simplified example that was tried at my end based on documentation, above python code snippet. You may need to fine-tune it based on your specific requirements and integration with your application.
Here is the step-by-step explanation of above Python implementation:
- Break your text into smaller chunks: Divide the text you want to synthesize into smaller, manageable chunks.
- Send each chunk for synthesis: Send each chunk of text to the Text-to-Speech API for synthesis. You would typically use the speak_text_async method for each chunk.
- Handle the audio output: As each chunk is synthesized, you will receive audio output. You can then play this audio output or save it to a file, depending on your application's requirements.
- Manage latency: To achieve low latency, you will need to ensure that you start processing the next chunk of text while the previous one is being synthesized. This way, you can achieve a more real-time experience.
Below is the output I got at my end, by following the steps mentioned in the documentation link below.
Note: The output is as per text stream passed as input. Voice was also fine at my end for below text stream. For some reason, I couldn't attach the audio file here. I hope you understand.
Output
Please have a look into the below documentation for more details:
How to lower speech synthesis latency using Speech SDK - Azure AI services | Microsoft Learn
I hope this information helps! Check and let me know, how it works.