Is there a way to get begin time and end time for the conversion result of stream audio?

klen 21 Reputation points
2021-01-18T14:45:08.767+00:00

I am using azure-speech to recognize audio stream, from speech_recognition_samples.cpp, from class RecognitionResult I only can get the Text and m_duration, but how can I get the begin time and end time of the result in the speech?
I use azure-speech in this way : write audio stream to AudioInputStream, and get result from SpeechRecognizer

void SpeechContinuousRecognitionWithPushStream()
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");

    auto pushStream = AudioInputStream::CreatePushStream();

    auto audioInput = AudioConfig::FromStreamInput(pushStream);
    auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);
    promise<void> recognitionEnd;

    recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl
               << "  Offset=" << e.Result->Offset() << std::endl
              << "  Duration=" << e.Result->Duration() << std::endl;
    });

    recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
                << "  Offset=" << e.Result->Offset() << std::endl
                << "  Duration=" << e.Result->Duration() << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

    recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        switch (e.Reason)
        {
        case CancellationReason::EndOfStream:
            cout << "CANCELED: Reach the end of the file." << std::endl;
            break;

        case CancellationReason::Error:
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
            cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
            recognitionEnd.set_value();
            break;

        default:
            cout << "CANCELED: received unknown reason." << std::endl;
        }

    });

    recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

    WavFileReader reader("whatstheweatherlike.wav");

    vector<uint8_t> buffer(1000);

    recognizer->StartContinuousRecognitionAsync().wait();

    int readSamples = 0;
    while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
    {
        // Push a buffer into the stream
        pushStream->Write(buffer.data(), readSamples);
    }

    // Close the push stream.
    pushStream->Close();

    // Waits for recognition end.
    recognitionEnd.get_future().get();

    // Stops recognition.
    recognizer->StopContinuousRecognitionAsync().get();
}
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,069 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator
    2021-01-19T04:43:16.01+00:00

    @klen There is an option to request word level timestamps by setting the same in your speech config settings. Similar, to setting the subscription key and region. This however does not explicitly give the begin and ending of a sentence but the offsets and duration of your sentence are mentioned. A sample output should look like below:

    # {"Duration":13400000,"NBest":[{"Confidence":0.9761951565742493,"Display":"What's the weather like?","ITN":"What's the weather like","Lexical":"what's the weather like","MaskedITN":"What's the weather like","Words":[{"Duration":3800000,"Offset":600000,"Word":"what's"},{"Duration":1200000,"Offset":4500000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]},{"Confidence":0.9245584011077881,"Display":"what is the weather like","ITN":"what is the weather like","Lexical":"what is the weather like","MaskedITN":"what is the weather like","Words":[{"Duration":2900000,"Offset":600000,"Word":"what"},{"Duration":700000,"Offset":3600000,"Word":"is"},{"Duration":1300000,"Offset":4400000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]}],"Offset":400000,"RecognitionStatus":"Success"}  
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.