Captions from Text to Speech?

Question

i've created several paragraphs of voices from some text, but now I want to export caption files with the voice timings, is that possible from the text to speech Speech Studio area?

Thanks!

Dave

Answer

Hi Dipazo, David (ELS-HBE),

I understand your issue, you're looking to generate captions for Text-to-Speech.

Currently, Speech Studio in Azure does not provide a direct way to export caption files with voice timings. However, you can achieve this by using the Azure Speech SDK to capture word timings and manually generate caption files.

Here’s a simple approach:

1.Use Azure Speech SDK: Write a script using the SDK to synthesize speech and capture word-level timings.

2.Generate Caption Files: After capturing the timings, you can format the output as .srt or another caption format by storing the timing of each word spoken.
This method allows you to create captions with the desired timings, even though Speech Studio doesn’t support it directly.

This is the code we can generate using python

import azure.cognitiveservices.speech as speechsdk

# Set up Azure Speech configuration
subscription_key = ""
region = ""
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)

# Create a list to store word timings
captions = []
start_time = 0  # Set the start time of the captions

# Function to handle word boundary events and capture timings
def word_boundary_callback(event):
    global start_time
    end_time = event.audio_offset / 10000  # Convert to milliseconds
    captions.append((start_time, end_time, event.text))
    start_time = end_time

# Create a Speech Synthesizer and attach word boundary event handler
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
synthesizer.synthesis_word_boundary.connect(word_boundary_callback)

# Text to synthesize into speech
text = "This is a caption example from Azure Text to Speech."

# Synthesize speech and capture word timings
synthesizer.speak_text_async(text).get()

# Display captured timings 
for caption in captions:
    print(f"Start: {caption[0]} ms, End: {caption[1]} ms, Text: '{caption[2]}'")

Here is the output screenshot
User's image

I hope you understand. Thank you.

Share via

Captions from Text to Speech?

1 answer

Your answer