Speech-To-Text sends back recognised text every 30s, how can I extend this duration?

Question

Speech-To-Text sends back recognised text every 30s, how can I extend this duration?

Ye Yutong 0

I am using Microsoft.CognitiveServices.Speech library on Unity.

Desired behaviour: When button is clicked, if speech haven't start, start speech to text recognition. If speech has started, stop speech to text recognition.

Actual behaviour: When speech is longer than 30s, message only shows up to 30s of what was being spoken. When the next 30s is recognised, the new 30s message replaces the first 30s message. The message only stores 30s worth of message every time, when the button has not been pressed again to stop the speech to text. The speech to text is successful and not cancelled, as the result.Reason always shows RecognizedSpeech.

How can I allow speech of more than 30s to be recognized at one go?

using UnityEngine;
using UnityEngine.UI;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using TMPro;
public class SpeechToText : MonoBehaviour
{
    public TextMeshProUGUI outputText;
    public Button startRecordButton;
    SpeechRecognizer recognizer;
    SpeechConfig speechConfig;
    AudioConfig audioConfig;
    private object threadLocker = new object();
    private bool speechStarted = false;
    public string message;
    public bool sentenceIsRecognized = false;
    private void RecognizedHandler(object sender, SpeechRecognitionEventArgs e)
    {
        lock (threadLocker)
        {
            message = e.Result.Text;
            Debug.Log("threadlocker message :" + message);
            Debug.Log("Cancellation reason:" + e.Result.Reason);
        }
        
    }
    public async void ButtonClick()
    {
        if (ChatGPTTester.executeCount >= 2) {
            if (speechStarted)
            {
                await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
                lock(threadLocker)
                {
                    speechStarted = false;
                    Debug.Log("STT message :" + message);
                    
                    sentenceIsRecognized = true;
                }
            }
            else
            {
                await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
                lock (threadLocker)
                {
                    speechStarted = true;
                }
            }
        }
    }
    void Start()
    {
        startRecordButton.onClick.AddListener(ButtonClick);
        speechConfig = SpeechConfig.FromSubscription("", "");
        speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "4500");
        audioConfig = AudioConfig.FromDefaultMicrophoneInput();
        recognizer = new SpeechRecognizer(speechConfig, audioConfig);
        recognizer.Recognized += RecognizedHandler;
    }
    void Update()
    {
        lock (threadLocker)
        {
            if (outputText != null)
            {
                outputText.text = message;
            }
        }
    }
}

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-09T09:03:08.8366667+00:00

Ye Yutong If you are using StartContinuousRecognitionAsync() the recognized speech should contain all the speech until StopContinuousRecognitionAsync() is called. I see that in your case although you are calling these two methods the actual speech config is set in Start() and not in ButtonClick(). I think in your case the recognizer is not being initialized correctly and you are actually hitting the max of default timeouts when you actually click the button to start/stop and this speech input is captured whenever you are running the application.

Please see this issue which explains the various timeout limits and how they are set under the hood with events to identify what is being detected.

I think you can check this sample from the SDK repo and define the config in ButtonClick() and use start and stop continuous recognition with the bool speechStarted to record audio for the required duration. I hope this helps!!
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-10T16:35:58.3+00:00

Ye Yutong Did my earlier response help to check if you are able to start/stop the recording correctly to capture the complete speech?
Ye Yutong 0 Reputation points

2023-03-12T19:24:05.3333333+00:00
Emm... I shifted this part of the code to the start of the ButtonClick:

speechConfig = SpeechConfig.FromSubscription("", ""); speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "4500"); audioConfig = AudioConfig.FromDefaultMicrophoneInput(); recognizer = new SpeechRecognizer(speechConfig, audioConfig); recognizer.Recognized += RecognizedHandler;

But the 30s problem still persists. I have also looked through the sample from the SDK repo you sent, and tried to adapt the code to StartCountinuousRecognition, however, the 30s problem still persists in the new code. I am quite new to this, so I might require a bit more help in solving this issue...

Ye Yutong 0

In my original code, I tried shifting the speech config and audio config code from Start() to the start of ButtonClick(), but the problem still persists. I've also looked through the SDK repo and tried to adapt it to using StartContinuousRecognition, but in the new code (attached below), the 30s problem seem to still be persisting. I am quite new to this, so I might require a bit more help in solving this issue...

using UnityEngine;
using UnityEngine.UI;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using TMPro;

public class SpeechToText : MonoBehaviour
{
    // Hook up the two properties below with a Text and Button object in your UI.
    public TextMeshProUGUI outputText;
    public Button startRecoButton;

    private object threadLocker = new object();
    private bool waitingForReco;
    public string message;
    private bool speechStarted = false;
    SpeechRecognizer recognizer;
    SpeechConfig speechConfig;
    AudioConfig audioConfig;
    public bool sentenceIsRecognized = false;

    private bool micPermissionGranted = false;

    private void RecognizedHandler(object sender, SpeechRecognitionEventArgs e)
    {
        string newMessage = string.Empty;
        if (e.Result.Reason == ResultReason.RecognizedSpeech)
        {
            newMessage = e.Result.Text;
        }
        else if (e.Result.Reason == ResultReason.NoMatch)
        {
            newMessage = "NOMATCH: Speech could not be recognized.";
        }
        else if (e.Result.Reason == ResultReason.Canceled)
        {
            var cancellation = CancellationDetails.FromResult(e.Result);
            newMessage = $"CANCELED: Reason={cancellation.Reason} ErrorDetails={cancellation.ErrorDetails}";
        }

        lock (threadLocker)
        {
            message = newMessage;
            waitingForReco = false;
            sentenceIsRecognized = true;
        }
    }

    public async void ButtonClick()
    {
        speechConfig = SpeechConfig.FromSubscription("", "");
        speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "4500");
        audioConfig = AudioConfig.FromDefaultMicrophoneInput();
        recognizer = new SpeechRecognizer(speechConfig, audioConfig);
        recognizer.Recognized += RecognizedHandler;
          
        if (ChatGPTTester.executeCount >= 2) {
            if (!waitingForReco)
            {
                await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
                lock (threadLocker)
                {
                    waitingForReco = true;
                }

            } else {
                await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
            }
        }
    }

    void Start()
    {
        if (outputText == null)
        {
            UnityEngine.Debug.LogError("outputText property is null! Assign a UI Text element to it.");
        }
        else if (startRecoButton == null)
        {
            message = "startRecoButton property is null! Assign a UI Button to it.";
            UnityEngine.Debug.LogError(message);
        }
        else
        {
            // Continue with normal initialization, Text and Button objects are present.
            micPermissionGranted = true;
            message = "Click button to recognize speech";
            startRecoButton.onClick.AddListener(ButtonClick);
        }
    }

    void Update()
    {
        lock (threadLocker)
        {
            if (startRecoButton != null)
            {
                startRecoButton.interactable = !waitingForReco && micPermissionGranted;
            }
            if (outputText != null)
            {
                outputText.text = message;
            }
        }
    }
}

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-14T05:48:51.0766667+00:00

Apologies for my delayed response. From what I see from the snippet it should work fine since this is the sequence that should kickoff irrespective of the method of recognizer that is called.

Start() => Create recognizer and subscribe to events (e.g. recognizing & recognized)
Disable() => Unsubscribe from events and dispose recognizer
ButtonClick() => Do StartContinuousRecognitionAsync and StopContinuousRecognitionAsync action
RecognizedHandler & Update() => Update message text in the UI

It might be easier to check the issue with SDK team on the SDK repo with more details on the version of SDK by raising an issue to understand if this is a behavior from SDK or your setup. Thanks!!

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-09T09:03:08.8366667+00:00

Ye Yutong If you are using StartContinuousRecognitionAsync() the recognized speech should contain all the speech until StopContinuousRecognitionAsync() is called. I see that in your case although you are calling these two methods the actual speech config is set in Start() and not in ButtonClick(). I think in your case the recognizer is not being initialized correctly and you are actually hitting the max of default timeouts when you actually click the button to start/stop and this speech input is captured whenever you are running the application.

Please see this issue which explains the various timeout limits and how they are set under the hood with events to identify what is being detected.

I think you can check this sample from the SDK repo and define the config in ButtonClick() and use start and stop continuous recognition with the bool speechStarted to record audio for the required duration. I hope this helps!!
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-10T16:35:58.3+00:00

Ye Yutong Did my earlier response help to check if you are able to start/stop the recording correctly to capture the complete speech?
Ye Yutong 0 Reputation points

2023-03-12T19:24:05.3333333+00:00

Emm... I shifted this part of the code to the start of the ButtonClick:

speechConfig = SpeechConfig.FromSubscription("", ""); speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "4500"); audioConfig = AudioConfig.FromDefaultMicrophoneInput(); recognizer = new SpeechRecognizer(speechConfig, audioConfig); recognizer.Recognized += RecognizedHandler;

But the 30s problem still persists. I have also looked through the sample from the SDK repo you sent, and tried to adapt the code to StartCountinuousRecognition, however, the 30s problem still persists in the new code. I am quite new to this, so I might require a bit more help in solving this issue...
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-14T05:48:51.0766667+00:00

Apologies for my delayed response. From what I see from the snippet it should work fine since this is the sequence that should kickoff irrespective of the method of recognizer that is called.

Start() => Create recognizer and subscribe to events (e.g. recognizing & recognized)
Disable() => Unsubscribe from events and dispose recognizer
ButtonClick() => Do StartContinuousRecognitionAsync and StopContinuousRecognitionAsync action
RecognizedHandler & Update() => Update message text in the UI

It might be easier to check the issue with SDK team on the SDK repo with more details on the version of SDK by raising an issue to understand if this is a behavior from SDK or your setup. Thanks!!

Share via

Speech-To-Text sends back recognised text every 30s, how can I extend this duration?

Your answer