How to reliably get all word boundary events from synthesized text with Azure Speech SDK

Gvidas Šniolis 0 Reputation points
2025-04-22T13:25:07.1333333+00:00

Hello,

I am building a project which will provide an interface for text-to-speech operations. Project consists of a Front-end website and a .NET ASP.NET Core Web API. Now the main goal is to return synthesized audio and word boundaries to the frontend in one go. All in all, that is not a problem - my current solution is to synthesize text with one method and gather events with a separate method (which is also synthesizing text for the 2nd time too), that listens to Speech Synthesis Word Boundary events and, gathers them all. I am using

SpeechSynthesis.SpeakTextAsync(string text)

to synthesize audio from inputted text.

Now the issue that I am facing is: according to my understanding .SpeakTextAsync(string text) returns before the text have completed synthesizing, meaning that after I await the call to synthesize text, I can still keep receiving the events, which is not ideal. What I would like is to have a guarantee, than once the code is finished awaiting .SpeakToTextAsync(string text) all the word boundary events have been emitted.

Now my questions are:

  1. Is my understanding of how SDK works in this case is correct? If no, then what am I getting wrong?
  2. What can I do to reliably get all of the word boundary events before a call to .SaveSpeechMarksToFile(...) method?
  3. Maybe there is a more elegant solution to this with the Azure Speech SDK and I am simply not seeing it?

The code currently looks like this

...

private ConcurrentDictionary<string, List<SpeechMarkDTO>> _wordBoundaries = new();

...
...

private async Task<SpeechMarksResultDTO?> GetSpeechMarks(RequestToSpeechDTO request, VoiceDTO voice)
{
    var speechConfig = _configuration.GetSpeechSDKConfiguration();
    
	// Synthetization configuration
    speechConfig.SpeechSynthesisVoiceName = voice.UniqueName;
    speechConfig.OutputFormat = Microsoft.CognitiveServices.Speech.OutputFormat.Detailed;
    
	// null as AudioConfig - prevents SDK from autoplaying audio as it synthesizes.
    using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, null as AudioConfig))
    {
        // Handlers for events emmited by SDK.
        var handler = new EventHandler<SpeechSynthesisWordBoundaryEventArgs>((s, args) => WordBoundaryReceived(s, args, request.ItemId, nameof(GetSpeechMarks)));
        var completionHandler = new EventHandler<SpeechSynthesisEventArgs>((s, args) => { _logger.LogInformation("SYNTHESIS COMPLETED"); });
        try
        {
            speechSynthesizer.SynthesisCompleted += completionHandler;
            speechSynthesizer.WordBoundary += handler;

            _logger.LogInformation("Will await .SpeakTextAsync()");
            var synthesisTask = await speechSynthesizer.SpeakTextAsync(request.Text);
            _logger.LogInformation("Finished awaiting .SpeakTextAsync()");
            
			// When code reaches following line, my understang is that it is not guaranteed that we will have all of the word boundary events.
            var boundaries = _wordBoundaries[request.ItemId] ?? new();
            var speechMarksUri = await _storageService.SaveSpeechMarksToFile((int)request.UserId, request.ItemId, boundaries);
            
            speechSynthesizer.WordBoundary -= handler;
            speechSynthesizer.SynthesisCompleted -= completionHandler;
            
			return new SpeechMarksResultDTO(boundaries, speechMarksUri);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error happenned while getting speech marks.");

            speechSynthesizer.SynthesisCompleted -= completionHandler;
            speechSynthesizer.WordBoundary -= handler;

            return null;
        }
    }
}


I am also attaching a screenshot, where it is evident, that we've received a word boundary event after a call to .SpeakTextAsync(string text) was awaited.

Grateful for all the help in advance!

Developer technologies | C#
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.