Share via

Azure Text to Speech Synthesizer.WordBoundary method not working to get word audio duration

Samir 41 Reputation points
2022-05-14T01:45:03.343+00:00

I am using Azure Cognitive Service to generate audio files, and I would like to get speech marks for the generated audio. As per the following link, WordBoundary event should give me the data I am looking for.

https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesizer.wordboundary?view=azure-dotnet
https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.speechsynthesiswordboundaryeventargs?view=azure-dotnet

Additionally, I have found a sample which gives explanation on how to bind the event and get speech marks using WordBoundary method.
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/sharedcontent/console/speech_synthesis_samples.cs

Method name: SynthesisWordBoundaryEventAsync

Based on this, I have created Azure function to return required data, however the WordBoundary event is not firing. Only difference I see that the sample code seems to be created for Console App and I am trying to use Azure function.

Any feedback would be helpful.

Here is my updated function based on a sample provided on GitHub.
text = text for given specific language to generate audio file
config = azure service config with given language and voice type settings.

private static async Task SynthesisWordBoundaryEventAsync(string text, SpeechConfig config)
{

        // Creates a speech synthesizer with a null output stream.  
        // This means the audio output data will not be written to any stream.  
        // You can just get the audio from the result.  
        using (var synthesizer = new SpeechSynthesizer(config, null as AudioConfig))  
        {  
            // Subscribes to word boundary event  
            synthesizer.WordBoundary += (s, e) =>  
            {  
                // The unit of e.AudioOffset is tick (1 tick = 100 nanoseconds), divide by 10,000 to convert to milliseconds.  
                Console.WriteLine($"Word boundary event received. Audio offset: " +  
                        $"{(e.AudioOffset + 5000) / 10000}ms, text offset: {e.TextOffset}, word length: {e.WordLength}.");  
            };  

            using (var result = await synthesizer.SpeakTextAsync(text))  
            {  
                if (result.Reason == ResultReason.SynthesizingAudioCompleted)  
                {  
                    Console.WriteLine($"Speech synthesized for text .");  
                    var audioData = result.AudioData;  
                    Console.WriteLine($"{audioData.Length} bytes of audio data received for text [{text}]");  
                }  
                else if (result.Reason == ResultReason.Canceled)  
                {  
                    var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);  
                    Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");  

                    if (cancellation.Reason == CancellationReason.Error)  
                    {  
                        Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");  
                        Console.WriteLine($"CANCELED: ErrorDetails=[{cancellation.ErrorDetails}]");  
                        Console.WriteLine($"CANCELED: Did you update the subscription info?");  
                    }  
                }  
            }  
        }  
    }  

Thanks,
Samir

Azure AI Speech
Azure AI Speech

An Azure service that integrates speech processing into apps and services.


1 answer

Sort by: Most helpful
  1. Yulin Li 6 Reputation points Microsoft Employee
    2022-05-19T10:09:31.507+00:00

    Hi @Samir , I checked your codes and it looks good.

    Could you share with us your SDK log (https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-logging)

    If you cannot upload file here, you can open an issue in our GitHub sample repo

    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.