I understand your issue, you're looking to generate captions for Text-to-Speech.
Currently, Speech Studio in Azure does not provide a direct way to export caption files with voice timings. However, you can achieve this by using the Azure Speech SDK to capture word timings and manually generate caption files.
Here’s a simple approach:
1.Use Azure Speech SDK: Write a script using the SDK to synthesize speech and capture word-level timings.
2.Generate Caption Files: After capturing the timings, you can format the output as .srt or another caption format by storing the timing of each word spoken.
This method allows you to create captions with the desired timings, even though Speech Studio doesn’t support it directly.
This is the code we can generate using python
import azure.cognitiveservices.speech as speechsdk
# Set up Azure Speech configuration
subscription_key = "<KEY>"
region = "<REGION>"
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
# Create a list to store word timings
captions = []
start_time = 0 # Set the start time of the captions
# Function to handle word boundary events and capture timings
def word_boundary_callback(event):
global start_time
end_time = event.audio_offset / 10000 # Convert to milliseconds
captions.append((start_time, end_time, event.text))
start_time = end_time
# Create a Speech Synthesizer and attach word boundary event handler
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
synthesizer.synthesis_word_boundary.connect(word_boundary_callback)
# Text to synthesize into speech
text = "This is a caption example from Azure Text to Speech."
# Synthesize speech and capture word timings
synthesizer.speak_text_async(text).get()
# Display captured timings
for caption in captions:
print(f"Start: {caption[0]} ms, End: {caption[1]} ms, Text: '{caption[2]}'")
Here is the output screenshot
I hope you understand. Thank you.