How to config disfluency removal using REST API

KEN KIM 16 Reputation points
2023-01-28T11:49:33.1266667+00:00

I am using the speech-to-text REST API (python) to do some research regarding fillers, pauses, and backtracking in Japanese (ja-JP).

Can I config disfluency removal while using the Speech-to-text service? I need to have true text with all the fillers in the transcribed text.

Azure STT service currently automatically removes filler words, but I need to have those filler words such as "eto" or "ano" after I do STT.

Are there any options to allow me to include those filler words when displaying text?

Thank you in advance.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,383 questions
{count} votes

1 answer

Sort by: Most helpful
  1. VasimTamboli 4,410 Reputation points
    2023-05-07T19:16:11.35+00:00

    Yes, you can configure the disfluency removal feature in the Azure Speech-to-Text (STT) service by using the appropriate parameters in the REST API request. To preserve the fillers in the transcribed text, you need to set the "DiarizationEnabled" parameter to "true" and the "ProfanityFilterMode" parameter to "Masked".

    Here's an example of how you can configure the REST API request to enable disfluency removal while preserving the filler words:

    import requests
    import json
    
    # Set the endpoint URL
    url = 'https://<your-service-region>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=ja-JP'
    
    # Set the request headers
    headers = {
        'Content-Type': 'audio/wav',
        'Ocp-Apim-Subscription-Key': '<your-subscription-key>',
    }
    
    # Set the request parameters
    params = {
        'DiarizationEnabled': 'true',
        'ProfanityFilterMode': 'Masked',
    }
    
    # Read the audio file as binary data
    with open('audio_file.wav', 'rb') as file:
        data = file.read()
    
    # Send the REST API request
    response = requests.post(url, headers=headers, params=params, data=data)
    
    # Parse the JSON response
    result = json.loads(response.content)
    
    # Get the transcribed text with the fillers
    text = result['DisplayText']
    print(text)
    
    

    In the above example, the "DiarizationEnabled" parameter is set to "true" to enable speaker diarization, which helps to identify and preserve the fillers in the transcribed text. The "ProfanityFilterMode" parameter is set to "Masked" to avoid removing the fillers as profanity.

    Note that the disfluency removal feature in the Azure STT service is not perfect and may still remove some of the fillers in the transcribed text. However, enabling speaker diarization and avoiding profanity filtering can help to preserve most of the fillers in the text.