Azure Text to Speech Synthesizer.WordBoundary method not working to get word audio duration

Question

Azure Text to Speech Synthesizer.WordBoundary method not working to get word audio duration

Samir 21

I am using Azure Cognitive Service to generate audio files, and I would like to get speech marks for the generated audio. As per the following link, WordBoundary event should give me the data I am looking for.

https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesizer.wordboundary?view=azure-dotnet
https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesiswordboundaryeventargs?view=azure-dotnet

Additionally, I have found a sample which gives explanation on how to bind the event and get speech marks using WordBoundary method.
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/sharedcontent/console/speech_synthesis_samples.cs

Method name: SynthesisWordBoundaryEventAsync

Based on this, I have created Azure function to return required data, however the WordBoundary event is not firing. Only difference I see that the sample code seems to be created for Console App and I am trying to use Azure function.

Any feedback would be helpful.

Here is my updated function based on a sample provided on GitHub.
text = text for given specific language to generate audio file
config = azure service config with given language and voice type settings.

private static async Task SynthesisWordBoundaryEventAsync(string text, SpeechConfig config)
{

        // Creates a speech synthesizer with a null output stream.  
        // This means the audio output data will not be written to any stream.  
        // You can just get the audio from the result.  
        using (var synthesizer = new SpeechSynthesizer(config, null as AudioConfig))  
        {  
            // Subscribes to word boundary event  
            synthesizer.WordBoundary += (s, e) =>  
            {  
                // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.  
                Console.WriteLine($"Word boundary event received. Audio offset: " +  
                        $"{(e.AudioOffset + 5000) / 10000}ms, text offset: {e.TextOffset}, word length: {e.WordLength}.");  
            };  

            using (var result = await synthesizer.SpeakTextAsync(text))  
            {  
                if (result.Reason == ResultReason.SynthesizingAudioCompleted)  
                {  
                    Console.WriteLine($"Speech synthesized for text .");  
                    var audioData = result.AudioData;  
                    Console.WriteLine($"{audioData.Length} bytes of audio data received for text [{text}]");  
                }  
                else if (result.Reason == ResultReason.Canceled)  
                {  
                    var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);  
                    Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");  

                    if (cancellation.Reason == CancellationReason.Error)  
                    {  
                        Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");  
                        Console.WriteLine($"CANCELED: ErrorDetails=[{cancellation.ErrorDetails}]");  
                        Console.WriteLine($"CANCELED: Did you update the subscription info?");  
                    }  
                }  
            }  
        }  
    }

Thanks,
Samir

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-05-16T14:02:35.093+00:00

@Samir Did you try the sample with a console app to check if the word boundaries are returned? I couldn't find a sample that would help set this up with azure functions, I am not an expert on azure functions but I am checking internally if there is a sample that could help to run TTS with azure functions.
Samir 21 Reputation points

2022-05-17T01:37:17.867+00:00

@romungi-MSFT

No, I haven't try the sample call with console app. Let me try it tomorrow, and I will share my findings with you.

Thanks,
Samir
Samir 21 Reputation points

2022-05-17T11:21:08.887+00:00

@romungi-MSFT Just tried, and it did not work form a console app either.

1 answer

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2022-05-16T14:02:35.093+00:00

@Samir Did you try the sample with a console app to check if the word boundaries are returned? I couldn't find a sample that would help set this up with azure functions, I am not an expert on azure functions but I am checking internally if there is a sample that could help to run TTS with azure functions.
Samir 21 Reputation points

2022-05-17T01:37:17.867+00:00

@romungi-MSFT

No, I haven't try the sample call with console app. Let me try it tomorrow, and I will share my findings with you.

Thanks,
Samir
Samir 21 Reputation points

2022-05-17T11:21:08.887+00:00

@romungi-MSFT Just tried, and it did not work form a console app either.

Answer 1

Yulin Li 6 Microsoft Employee

Hi @Samir , I checked your codes and it looks good.

Could you share with us your SDK log (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-logging)

If you cannot upload file here, you can open an issue in our GitHub sample repo

Samir 21 Reputation points

2023-01-15T21:57:27.6533333+00:00

@Yulin Li sorry for a late reply.

As I did try a work around, which to combine AWS Polly SON output with % time adjustment for audio from Microsoft. That work around seems to not be going to work for the future use case I have.

As you requested, I have attached the log. TextToSpeechLog.txt

Would you please help me resolve this issue? I planning to use Text to Speech for multiple languages using Microsoft Engine and I will need accurate speech mark without spending time to adjust manually.

Let me know if you need any additional detail from me.

@romungi-MSFT If you have any other suggestion let me know.

Thanks,

Samir
Sid Sadel 65 Reputation points

2023-08-30T20:15:00.5933333+00:00

@Samir did you ever solve this?

Share via

Azure Text to Speech Synthesizer.WordBoundary method not working to get word audio duration

1 answer

Your answer