Hi!
I'm currently trying to use Azure's text-to-speech service to fit my use case.
Ultimately, I need to fetch the voiceover audio data given the text, as well as gather the word bound data so I can apply subtitles.
I've integrated both using the SDK and REST endpoints. I have 2 questions
- Am I able to use the SDK in order to fetch audio data without having the synthesizer speak? And am I able to grab word bound data (synthesizer.addSynthesisWordBoundaryEventHandler) in some json return type without waiting for the synthesizer to speak out loud?
- Is it possible to get this word boundary data using just the REST endpoints? Maybe a header or something?
This is my working request using REST endpoints
var request = URLRequest(url: endpoint)
request.httpMethod = "POST"
request.addValue(apiKey, forHTTPHeaderField: "Ocp-Apim-Subscription-Key")
request.addValue("application/ssml+xml", forHTTPHeaderField: "Content-Type")
request.addValue("riff-48khz-16bit-mono-pcm", forHTTPHeaderField: "X-Microsoft-OutputFormat") // audio-24khz-160kbitrate-mono-mp3
request.addValue("MyAppsName", forHTTPHeaderField: "User-Agent")
let ssml = """
<speak version='1.0' xml:lang='en-US'>
<voice name='\(voice.id)'>
<prosody rate="+30.00%" pitch="-30Hz">
\(story.title) <break time="600ms" /> <bookmark mark='title_end'/> <break time="600ms" />
\(story.body)
</prosody>
</voice>
</speak>
"""
request.httpBody = ssml.data(using: .utf8)
and this is how I'm using the SDK currently
let ssml = """
<speak version='1.0' xml:lang='en-US'>
<voice name='\(voice.id)'>
<prosody rate="+30.00%" pitch="-30Hz">
\(story.title) <break time="600ms" /> <bookmark mark='title_end'/> <break time="600ms" />
\(story.body)
</prosody>
</voice>
</speak>
"""
var speechConfig: SPXSpeechConfiguration?
do {
try speechConfig = SPXSpeechConfiguration(subscription: apiKey, region: "eastus")
speechConfig?.requestWordLevelTimestamps()
} catch {
print("error \(error) happened")
speechConfig = nil
}
let synthesizer = try! SPXSpeechSynthesizer(speechConfig!)
synthesizer.addSynthesisWordBoundaryEventHandler { synthesizer, eventArg in
print("-->", eventArg.text, eventArg.audioOffset, eventArg.boundaryType, eventArg.duration, eventArg.textOffset)
}
synthesizer.addBookmarkReachedEventHandler { synthesizer, eventArg in
print(eventArg.text, eventArg.audioOffset)
}
let result = try! synthesizer.speakSsml(ssml)
if result.reason == SPXResultReason.canceled {
let cancellationDetails = try! SPXSpeechSynthesisCancellationDetails(fromCanceledSynthesisResult: result)
print("cancelled, detail: \(cancellationDetails.errorDetails!) ")
}
if let data = result.audioData {
FileManager.default.write(data: data, to: .currentVoiceover, storage: .temporary) { error, url in
guard let url = url, error == nil else {
return
}
completion(nil, url)
}
}
thanks!