How to reliably get all word boundary events from synthesized text with Azure Speech SDK

Question

How to reliably get all word boundary events from synthesized text with Azure Speech SDK

Gvidas Šniolis 0

Hello,

I am building a project which will provide an interface for text-to-speech operations. Project consists of a Front-end website and a .NET ASP.NET Core Web API. Now the main goal is to return synthesized audio and word boundaries to the frontend in one go. All in all, that is not a problem - my current solution is to synthesize text with one method and gather events with a separate method (which is also synthesizing text for the 2nd time too), that listens to Speech Synthesis Word Boundary events and, gathers them all. I am using

SpeechSynthesis.SpeakTextAsync(string text)

to synthesize audio from inputted text.

Now the issue that I am facing is: according to my understanding .SpeakTextAsync(string text) returns before the text have completed synthesizing, meaning that after I await the call to synthesize text, I can still keep receiving the events, which is not ideal. What I would like is to have a guarantee, than once the code is finished awaiting .SpeakToTextAsync(string text) all the word boundary events have been emitted.

Now my questions are:

Is my understanding of how SDK works in this case is correct? If no, then what am I getting wrong?
What can I do to reliably get all of the word boundary events before a call to .SaveSpeechMarksToFile(...) method?
Maybe there is a more elegant solution to this with the Azure Speech SDK and I am simply not seeing it?

The code currently looks like this

...

private ConcurrentDictionary<string, List<SpeechMarkDTO>> _wordBoundaries = new();

...
...

private async Task<SpeechMarksResultDTO?> GetSpeechMarks(RequestToSpeechDTO request, VoiceDTO voice)
{
    var speechConfig = _configuration.GetSpeechSDKConfiguration();
    
	// Synthetization configuration
    speechConfig.SpeechSynthesisVoiceName = voice.UniqueName;
    speechConfig.OutputFormat = Microsoft.CognitiveServices.Speech.OutputFormat.Detailed;
    
	// null as AudioConfig - prevents SDK from autoplaying audio as it synthesizes.
    using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, null as AudioConfig))
    {
        // Handlers for events emmited by SDK.
        var handler = new EventHandler<SpeechSynthesisWordBoundaryEventArgs>((s, args) => WordBoundaryReceived(s, args, request.ItemId, nameof(GetSpeechMarks)));
        var completionHandler = new EventHandler<SpeechSynthesisEventArgs>((s, args) => { _logger.LogInformation("SYNTHESIS COMPLETED"); });
        try
        {
            speechSynthesizer.SynthesisCompleted += completionHandler;
            speechSynthesizer.WordBoundary += handler;

            _logger.LogInformation("Will await .SpeakTextAsync()");
            var synthesisTask = await speechSynthesizer.SpeakTextAsync(request.Text);
            _logger.LogInformation("Finished awaiting .SpeakTextAsync()");
            
			// When code reaches following line, my understang is that it is not guaranteed that we will have all of the word boundary events.
            var boundaries = _wordBoundaries[request.ItemId] ?? new();
            var speechMarksUri = await _storageService.SaveSpeechMarksToFile((int)request.UserId, request.ItemId, boundaries);
            
            speechSynthesizer.WordBoundary -= handler;
            speechSynthesizer.SynthesisCompleted -= completionHandler;
            
			return new SpeechMarksResultDTO(boundaries, speechMarksUri);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error happenned while getting speech marks.");

            speechSynthesizer.SynthesisCompleted -= completionHandler;
            speechSynthesizer.WordBoundary -= handler;

            return null;
        }
    }
}

I am also attaching a screenshot, where it is evident, that we've received a word boundary event after a call to .SpeakTextAsync(string text) was awaited.

Grateful for all the help in advance!

Gvidas Šniolis 0 Reputation points

2025-04-22T13:27:19.1433333+00:00

EDIT: Adding a screenshot of logs, illustrating the issue.

Your answer

Gvidas Šniolis 0 Reputation points

2025-04-22T13:27:19.1433333+00:00

EDIT: Adding a screenshot of logs, illustrating the issue.

Share via

How to reliably get all word boundary events from synthesized text with Azure Speech SDK

Your answer