Hi @Abas Oladosu , Thanks for using Microsoft Q&A Platform.
I don't think we can determine the duration of an audio without synthesizing it. Because, the duration of the audio is determined by the length of the speech that is synthesized, which in turn depends on the text that is being converted to speech and the speaking rate.
The synthesis_word_boundary event is raised at the beginning of each new spoken word, punctuation, and sentence. The event reports the current word's time offset (in ticks) from the beginning of the output audio, as well as the character position in the input text immediately before the word that's about to be spoken. https://learn.microsoft.com/en-us/azure/cognitive-services/Speech-Service/how-to-speech-synthesis?tabs=browserjs%2Cterminal&pivots=programming-language-csharp#subscribe-to-synthesizer-events
Here is the sample code for word_boundary_event: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/python/console/speech_synthesis_sample.py
def speech_synthesis_word_boundary_event():
"""performs speech synthesis and shows the word boundary event."""
# Creates an instance of a speech config with specified subscription key and service region.
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
# Creates a speech synthesizer with a null output stream.
# This means the audio output data will not be written to any output channel.
# You can just get the audio from the result.
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
# Subscribes to word boundary event
# The unit of evt.audio_offset is tick (1 tick = 100 nanoseconds), divide it by 10,000 to convert to milliseconds.
speech_synthesizer.synthesis_word_boundary.connect(lambda evt: print(
"Word boundary event received: {}, audio offset in ms: {}ms".format(evt, evt.audio_offset / 10000)))
# Receives a text from console input and synthesizes it to result.
while True:
print("Enter some text that you want to synthesize, Ctrl-Z to exit")
try:
text = input()
except EOFError:
break
result = speech_synthesizer.speak_text_async(text).get()
# Check result
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(text))
audio_data = result.audio_data
print("{} bytes of audio data received.".format(len(audio_data)))
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print("Error details: {}".format(cancellation_details.error_details))
The output
Generally, this is the process to calculate word_boundary.
If you still wanted to calculate the duration, then my suggestion is you can generally calculate based on the length of words divided by average speaking rate per minute (Assuming 150WPM). This is just an estimate of average speaking rate for English (taken from https://virtualspeech.com/blog/average-speaking-rate-words-per-minute) and the actual duration of the synthesized speech may vary depending on the specific text and the voice used for synthesis. Unfortunately, I can't give you the exact value.
Try these and let us know.
I hope this helps.
Regards,
Vasavi
-Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.