How to reliably get all word boundary events from synthesized text with Azure Speech SDK
Hello,
I am building a project which will provide an interface for text-to-speech operations. Project consists of a Front-end website and a .NET ASP.NET Core Web API. Now the main goal is to return synthesized audio and word boundaries to the frontend in one go. All in all, that is not a problem - my current solution is to synthesize text with one method and gather events with a separate method (which is also synthesizing text for the 2nd time too), that listens to Speech Synthesis Word Boundary events and, gathers them all. I am using
SpeechSynthesis.SpeakTextAsync(string text)
to synthesize audio from inputted text.
Now the issue that I am facing is: according to my understanding .SpeakTextAsync(string text)
returns before the text have completed synthesizing, meaning that after I await
the call to synthesize text, I can still keep receiving the events, which is not ideal. What I would like is to have a guarantee, than once the code is finished awaiting .SpeakToTextAsync(string text)
all the word boundary events have been emitted.
Now my questions are:
- Is my understanding of how SDK works in this case is correct? If no, then what am I getting wrong?
- What can I do to reliably get all of the word boundary events before a call to
.SaveSpeechMarksToFile(...)
method? - Maybe there is a more elegant solution to this with the Azure Speech SDK and I am simply not seeing it?
The code currently looks like this
...
private ConcurrentDictionary<string, List<SpeechMarkDTO>> _wordBoundaries = new();
...
...
private async Task<SpeechMarksResultDTO?> GetSpeechMarks(RequestToSpeechDTO request, VoiceDTO voice)
{
var speechConfig = _configuration.GetSpeechSDKConfiguration();
// Synthetization configuration
speechConfig.SpeechSynthesisVoiceName = voice.UniqueName;
speechConfig.OutputFormat = Microsoft.CognitiveServices.Speech.OutputFormat.Detailed;
// null as AudioConfig - prevents SDK from autoplaying audio as it synthesizes.
using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, null as AudioConfig))
{
// Handlers for events emmited by SDK.
var handler = new EventHandler<SpeechSynthesisWordBoundaryEventArgs>((s, args) => WordBoundaryReceived(s, args, request.ItemId, nameof(GetSpeechMarks)));
var completionHandler = new EventHandler<SpeechSynthesisEventArgs>((s, args) => { _logger.LogInformation("SYNTHESIS COMPLETED"); });
try
{
speechSynthesizer.SynthesisCompleted += completionHandler;
speechSynthesizer.WordBoundary += handler;
_logger.LogInformation("Will await .SpeakTextAsync()");
var synthesisTask = await speechSynthesizer.SpeakTextAsync(request.Text);
_logger.LogInformation("Finished awaiting .SpeakTextAsync()");
// When code reaches following line, my understang is that it is not guaranteed that we will have all of the word boundary events.
var boundaries = _wordBoundaries[request.ItemId] ?? new();
var speechMarksUri = await _storageService.SaveSpeechMarksToFile((int)request.UserId, request.ItemId, boundaries);
speechSynthesizer.WordBoundary -= handler;
speechSynthesizer.SynthesisCompleted -= completionHandler;
return new SpeechMarksResultDTO(boundaries, speechMarksUri);
}
catch (Exception ex)
{
_logger.LogError(ex, "Error happenned while getting speech marks.");
speechSynthesizer.SynthesisCompleted -= completionHandler;
speechSynthesizer.WordBoundary -= handler;
return null;
}
}
}
I am also attaching a screenshot, where it is evident, that we've received a word boundary event after a call to .SpeakTextAsync(string text)
was awaited.
Grateful for all the help in advance!