In the text-to-speech SDK for iOS, am I able to fetch the audio data without having the synthesizer speak?

Question

Hi!

I'm currently trying to use Azure's text-to-speech service to fit my use case.

Ultimately, I need to fetch the voiceover audio data given the text, as well as gather the word bound data so I can apply subtitles.

I've integrated both using the SDK and REST endpoints. I have 2 questions

Am I able to use the SDK in order to fetch audio data without having the synthesizer speak? And am I able to grab word bound data (synthesizer.addSynthesisWordBoundaryEventHandler) in some json return type without waiting for the synthesizer to speak out loud?
Is it possible to get this word boundary data using just the REST endpoints? Maybe a header or something?

This is my working request using REST endpoints

var request = URLRequest(url: endpoint)
        request.httpMethod = "POST"
        request.addValue(apiKey, forHTTPHeaderField: "Ocp-Apim-Subscription-Key")
        request.addValue("application/ssml+xml", forHTTPHeaderField: "Content-Type")
        request.addValue("riff-48khz-16bit-mono-pcm", forHTTPHeaderField: "X-Microsoft-OutputFormat") // audio-24khz-160kbitrate-mono-mp3
        request.addValue("MyAppsName", forHTTPHeaderField: "User-Agent")
        
        let ssml = """
            
                
                    
                        \(story.title)   
                        \(story.body)
                    
                
            
            """
        request.httpBody = ssml.data(using: .utf8)

and this is how I'm using the SDK currently

let ssml = """
            
                
                    
                        \(story.title)   
                        \(story.body)
                    
                
            
            """
        
        var speechConfig: SPXSpeechConfiguration?
        do {
            try speechConfig = SPXSpeechConfiguration(subscription: apiKey, region: "eastus")
            speechConfig?.requestWordLevelTimestamps()
        } catch {
            print("error \(error) happened")
            speechConfig = nil
        }
        
        let synthesizer = try! SPXSpeechSynthesizer(speechConfig!)
        
        synthesizer.addSynthesisWordBoundaryEventHandler { synthesizer, eventArg in
            print("-->", eventArg.text, eventArg.audioOffset, eventArg.boundaryType, eventArg.duration, eventArg.textOffset)
        }
        synthesizer.addBookmarkReachedEventHandler { synthesizer, eventArg in
            print(eventArg.text, eventArg.audioOffset)
        }
        
        let result = try! synthesizer.speakSsml(ssml)
        
        if result.reason == SPXResultReason.canceled {
            let cancellationDetails = try! SPXSpeechSynthesisCancellationDetails(fromCanceledSynthesisResult: result)
            print("cancelled, detail: \(cancellationDetails.errorDetails!) ")
        }
        
        if let data = result.audioData {
            FileManager.default.write(data: data, to: .currentVoiceover, storage: .temporary) { error, url in
                guard let url = url, error == nil else {
                    return
                }
                completion(nil, url)
            }
        }

thanks!

Accepted Answer

@Sid Sadel I think for the first scenario you want the output but just not have the synthesizer output the audio to your default speaker. For this scenario, you would need to have your audio configuration set to a file output. Basically, set SPXAudioConfiguration with initWithDefaultSpeakerOutput, see this sample from github sdk sample repo.

For the second scenario, wordboundary can be enabled with REST API but with batch synthesis API. The parameter to pass in the parameter body of the request is "wordBoundaryEnabled": true

See this page for reference. I hope this helps!!

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

In the text-to-speech SDK for iOS, am I able to fetch the audio data without having the synthesizer speak?

0 additional answers