Is there a way to get begin time and end time for the conversion result of stream audio?

Question

Is there a way to get begin time and end time for the conversion result of stream audio?

klen 21

I am using azure-speech to recognize audio stream, from speech_recognition_samples.cpp, from class RecognitionResult I only can get the Text and m_duration, but how can I get the begin time and end time of the result in the speech?
I use azure-speech in this way : write audio stream to AudioInputStream, and get result from SpeechRecognizer

void SpeechContinuousRecognitionWithPushStream()
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");

    auto pushStream = AudioInputStream::CreatePushStream();

    auto audioInput = AudioConfig::FromStreamInput(pushStream);
    auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);
    promise<void> recognitionEnd;

    recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl
               << "  Offset=" << e.Result->Offset() << std::endl
              << "  Duration=" << e.Result->Duration() << std::endl;
    });

    recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
                << "  Offset=" << e.Result->Offset() << std::endl
                << "  Duration=" << e.Result->Duration() << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

    recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        switch (e.Reason)
        {
        case CancellationReason::EndOfStream:
            cout << "CANCELED: Reach the end of the file." << std::endl;
            break;

        case CancellationReason::Error:
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
            cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
            recognitionEnd.set_value();
            break;

        default:
            cout << "CANCELED: received unknown reason." << std::endl;
        }

    });

    recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

    WavFileReader reader("whatstheweatherlike.wav");

    vector<uint8_t> buffer(1000);

    recognizer->StartContinuousRecognitionAsync().wait();

    int readSamples = 0;
    while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
    {
        // Push a buffer into the stream
        pushStream->Write(buffer.data(), readSamples);
    }

    // Close the push stream.
    pushStream->Close();

    // Waits for recognition end.
    recognitionEnd.get_future().get();

    // Stops recognition.
    recognizer->StopContinuousRecognitionAsync().get();
}

1 answer

Your answer

Answer 1

romungi-MSFT 48,911 Microsoft Employee Moderator

@klen There is an option to request word level timestamps by setting the same in your speech config settings. Similar, to setting the subscription key and region. This however does not explicitly give the begin and ending of a sentence but the offsets and duration of your sentence are mentioned. A sample output should look like below:

# {"Duration":13400000,"NBest":[{"Confidence":0.9761951565742493,"Display":"What's the weather like?","ITN":"What's the weather like","Lexical":"what's the weather like","MaskedITN":"What's the weather like","Words":[{"Duration":3800000,"Offset":600000,"Word":"what's"},{"Duration":1200000,"Offset":4500000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]},{"Confidence":0.9245584011077881,"Display":"what is the weather like","ITN":"what is the weather like","Lexical":"what is the weather like","MaskedITN":"what is the weather like","Words":[{"Duration":2900000,"Offset":600000,"Word":"what"},{"Duration":700000,"Offset":3600000,"Word":"is"},{"Duration":1300000,"Offset":4400000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]}],"Offset":400000,"RecognitionStatus":"Success"}

klen 21 Reputation points

2021-01-25T10:02:32.873+00:00

I am still confused about how to use this function, could you please give me some samples? Thanks

klen 21

Why the offset of result each time is always 6800000? I think It should be increasing continuously, such as :
the begin offset of "my" is 0, and the end offset of "my" is 100000,
the begin offset of "my voice is" is 0 and the end offset of "my voice is" 200000.

My res is:

Recognizing:my
    Offset=6800000
  Duration=2700000
Recognizing:my voice is
    Offset=6800000
  Duration=8500000
Recognizing:my voice is my
    Offset=6800000
  Duration=9800000
Recognizing:my voice is my passport
    Offset=6800000
  Duration=14400000
Recognizing:my voice is my passport verify me
  Offset=6800000
  Duration=26100000
RECOGNIZED: Text=My voice is my passport, verify me.
     Offset=6800000
  Duration=28100000
CANCELED: Reach the end of the file.

Share via

Is there a way to get begin time and end time for the conversion result of stream audio?

1 answer

Your answer