Captions from Text to Speech?

Dipazo, David (ELS-HBE) 0 Reputation points
2024-09-20T16:08:36.1133333+00:00

i've created several paragraphs of voices from some text, but now I want to export caption files with the voice timings, is that possible from the text to speech Speech Studio area?

Thanks!

Dave

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,743 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Pavankumar Purilla 345 Reputation points Microsoft Vendor
    2024-09-23T08:17:57.16+00:00

    Hi Dipazo, David (ELS-HBE),

    I understand your issue, you're looking to generate captions for Text-to-Speech.

    Currently, Speech Studio in Azure does not provide a direct way to export caption files with voice timings. However, you can achieve this by using the Azure Speech SDK to capture word timings and manually generate caption files.

    Here’s a simple approach:

    1.Use Azure Speech SDK: Write a script using the SDK to synthesize speech and capture word-level timings.

    2.Generate Caption Files: After capturing the timings, you can format the output as .srt or another caption format by storing the timing of each word spoken.
    This method allows you to create captions with the desired timings, even though Speech Studio doesn’t support it directly.

    This is the code we can generate using python

    import azure.cognitiveservices.speech as speechsdk
    
    # Set up Azure Speech configuration
    subscription_key = "<KEY>"
    region = "<REGION>"
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=region)
    
    # Create a list to store word timings
    captions = []
    start_time = 0  # Set the start time of the captions
    
    # Function to handle word boundary events and capture timings
    def word_boundary_callback(event):
        global start_time
        end_time = event.audio_offset / 10000  # Convert to milliseconds
        captions.append((start_time, end_time, event.text))
        start_time = end_time
    
    # Create a Speech Synthesizer and attach word boundary event handler
    synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    synthesizer.synthesis_word_boundary.connect(word_boundary_callback)
    
    # Text to synthesize into speech
    text = "This is a caption example from Azure Text to Speech."
    
    # Synthesize speech and capture word timings
    synthesizer.speak_text_async(text).get()
    
    # Display captured timings 
    for caption in captions:
        print(f"Start: {caption[0]} ms, End: {caption[1]} ms, Text: '{caption[2]}'")
    

    Here is the output screenshot
    User's image

    I hope you understand. Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.