In the text-to-speech SDK for iOS, am I able to fetch the audio data without having the synthesizer speak?

Sid Sadel 40 Reputation points
2023-08-31T19:47:17.3266667+00:00

Hi!

I'm currently trying to use Azure's text-to-speech service to fit my use case.

Ultimately, I need to fetch the voiceover audio data given the text, as well as gather the word bound data so I can apply subtitles.

I've integrated both using the SDK and REST endpoints. I have 2 questions

  1. Am I able to use the SDK in order to fetch audio data without having the synthesizer speak? And am I able to grab word bound data (synthesizer.addSynthesisWordBoundaryEventHandler) in some json return type without waiting for the synthesizer to speak out loud?
  2. Is it possible to get this word boundary data using just the REST endpoints? Maybe a header or something?

This is my working request using REST endpoints

var request = URLRequest(url: endpoint)
        request.httpMethod = "POST"
        request.addValue(apiKey, forHTTPHeaderField: "Ocp-Apim-Subscription-Key")
        request.addValue("application/ssml+xml", forHTTPHeaderField: "Content-Type")
        request.addValue("riff-48khz-16bit-mono-pcm", forHTTPHeaderField: "X-Microsoft-OutputFormat") // audio-24khz-160kbitrate-mono-mp3
        request.addValue("MyAppsName", forHTTPHeaderField: "User-Agent")
        
        let ssml = """
            <speak version='1.0' xml:lang='en-US'>
                <voice name='\(voice.id)'>
                    <prosody rate="+30.00%" pitch="-30Hz">
                        \(story.title) <break time="600ms" /> <bookmark mark='title_end'/> <break time="600ms" />
                        \(story.body)
                    </prosody>
                </voice>
            </speak>
            """
        request.httpBody = ssml.data(using: .utf8)

and this is how I'm using the SDK currently

let ssml = """
            <speak version='1.0' xml:lang='en-US'>
                <voice name='\(voice.id)'>
                    <prosody rate="+30.00%" pitch="-30Hz">
                        \(story.title) <break time="600ms" /> <bookmark mark='title_end'/> <break time="600ms" />
                        \(story.body)
                    </prosody>
                </voice>
            </speak>
            """
        
        var speechConfig: SPXSpeechConfiguration?
        do {
            try speechConfig = SPXSpeechConfiguration(subscription: apiKey, region: "eastus")
            speechConfig?.requestWordLevelTimestamps()
        } catch {
            print("error \(error) happened")
            speechConfig = nil
        }
        
        let synthesizer = try! SPXSpeechSynthesizer(speechConfig!)
        
        synthesizer.addSynthesisWordBoundaryEventHandler { synthesizer, eventArg in
            print("-->", eventArg.text, eventArg.audioOffset, eventArg.boundaryType, eventArg.duration, eventArg.textOffset)
        }
        synthesizer.addBookmarkReachedEventHandler { synthesizer, eventArg in
            print(eventArg.text, eventArg.audioOffset)
        }
        
        let result = try! synthesizer.speakSsml(ssml)
        
        if result.reason == SPXResultReason.canceled {
            let cancellationDetails = try! SPXSpeechSynthesisCancellationDetails(fromCanceledSynthesisResult: result)
            print("cancelled, detail: \(cancellationDetails.errorDetails!) ")
        }
        
        if let data = result.audioData {
            FileManager.default.write(data: data, to: .currentVoiceover, storage: .temporary) { error, url in
                guard let url = url, error == nil else {
                    return
                }
                completion(nil, url)
            }
        }

thanks!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,378 questions
{count} votes

Accepted answer
  1. romungi-MSFT 41,856 Reputation points Microsoft Employee
    2023-09-01T10:43:03.91+00:00

    @Sid Sadel I think for the first scenario you want the output but just not have the synthesizer output the audio to your default speaker. For this scenario, you would need to have your audio configuration set to a file output. Basically, set SPXAudioConfiguration with initWithDefaultSpeakerOutput, see this sample from github sdk sample repo.

    For the second scenario, wordboundary can be enabled with REST API but with batch synthesis API. The parameter to pass in the parameter body of the request is "wordBoundaryEnabled": true

    See this page for reference. I hope this helps!!

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

0 additional answers

Sort by: Most helpful