Get speech recognition results
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
speechRecognizer.Recognizing += (object sender, SpeechRecognitionEventArgs e) =>
{
if (e.Result.Reason == ResultReason.RecognizingSpeech)
{
Console.WriteLine(String.Format ("RECOGNIZING: {0}", e.Result.Text));
Console.WriteLine(String.Format ("Offset in Ticks: {0}", e.Result.OffsetInTicks));
Console.WriteLine(String.Format ("Duration in Ticks: {0}", e.Result.Duration.Ticks));
}
};
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speechConfig.RequestWordLevelTimestamps();
This code snippet shows how to get the offset and duration from a Recognized
event.
speechRecognizer.Recognized += (object sender, SpeechRecognitionEventArgs e) =>
{
if (ResultReason.RecognizedSpeech == e.Result.Reason && e.Result.Text.Length > 0)
{
Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
Console.WriteLine(String.Format ("Offset in Ticks: {0}", e.Result.OffsetInTicks));
Console.WriteLine(String.Format ("Duration in Ticks: {0}", e.Result.Duration.Ticks));
var detailedResults = e.Result.Best();
if(detailedResults != null && detailedResults.Any())
{
// The first item in detailedResults corresponds to the recognized text.
// This is not necessarily the item with the highest confidence number.
var bestResults = detailedResults?.ToList()[0];
Console.WriteLine(String.Format("\tConfidence: {0}\n\tText: {1}\n\tLexicalForm: {2}\n\tNormalizedForm: {3}\n\tMaskedNormalizedForm: {4}",
bestResults.Confidence, bestResults.Text, bestResults.LexicalForm, bestResults.NormalizedForm, bestResults.MaskedNormalizedForm));
// You must set speechConfig.RequestWordLevelTimestamps() to get word-level timestamps.
Console.WriteLine($"\tWord-level timing:");
Console.WriteLine($"\t\tWord | Offset | Duration");
Console.WriteLine($"\t\t----- | ----- | ----- ");
foreach (var word in bestResults.Words)
{
Console.WriteLine($"\t\t{word.Word} | {word.Offset} | {word.Duration}");
}
}
}
};
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
speechRecognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
{
cout << "Recognizing:" << e.Result->Text << std::endl;
cout << "Offset in Ticks:" << e.Result->Offset() << std::endl;
cout << "Duration in Ticks:" << e.Result->Duration() << std::endl;
});
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speechConfig->RequestWordLevelTimestamps();
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (Go) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
func recognizingHandler(event speech.SpeechRecognitionEventArgs) {
defer event.Close()
fmt.Println("Recognizing:", event.Result.Text)
fmt.Println("Offset in Ticks:", event.Result.Offset)
fmt.Println("Duration in Ticks:", event.Result.Duration)
}
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speechConfig.RequestWordLevelTimestamps();
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
speechRecognizer.recognizing.addEventListener((s, e) -> {
System.out.println("RECOGNIZING: " + e.getResult().getText());
System.out.println("Offset in Ticks: " + e.getResult().getOffset());
System.out.println("Duration in Ticks: " + e.getResult().getDuration());
});
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speechConfig.requestWordLevelTimestamps();
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (npm) | Additional samples on GitHub | Library source code
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
speechRecognizer.recognizing = function (s, e) {
console.log("RECOGNIZING: " + e.result.text);
console.log("Offset in Ticks: " + e.result.offset);
console.log("Duration in Ticks: " + e.result.duration);
};
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speechConfig.requestWordLevelTimestamps();
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (download) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
[speechConfig requestWordLevelTimestamps];
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (download) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
[speechConfig requestWordLevelTimestamps];
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
Reference documentation | Package (PyPi) | Additional samples on GitHub
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
This code snippet shows how to get the offset and duration from a Recognizing
event.
def recognizing_handler(e : speechsdk.SpeechRecognitionEventArgs) :
if speechsdk.ResultReason.RecognizingSpeech == e.result.reason and len(e.result.text) > 0 :
print("Recognized: {}".format(result.text))
print("Offset in Ticks: {}".format(result.offset))
print("Duration in Ticks: {}".format(result.duration))
Recognized offset and duration
Once an utterance has been recognized, you can get the offset and duration of the recognized speech. With the Recognized
event, you can also get the offset and duration per word. To request the offset and duration per word, first you must set the corresponding SpeechConfig
property as shown here:
speech_config.request_word_level_timestamps()
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).
In this how-to guide, you learn about how you can use speech recognition results.
Speech synchronization
You might want to synchronize transcriptions with an audio track, whether it's done in real-time or with a prerecording.
The Speech service returns the offset and duration of the recognized speech.
- Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from
0
(zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second. - Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.
The end of a single utterance is determined by listening for silence at the end. You won't get the final recognition result until an utterance has completed. Recognizing events will provide intermediate results that are subject to change while an audio stream is being processed. Recognized events will provide the final transcribed text once processing of an utterance is completed.
Recognizing offset and duration
You'll want to synchronize captions with the audio track, whether it's done in real-time or with a prerecording. With the Recognizing
event, you can get the offset and duration of the speech being recognized. Offset and duration per word are not available while recognition is in progress. Each Recognizing
event comes with a textual estimate of the speech recognized so far.
For example, run the following command to get the offset and duration of the recognized speech:
spx recognize --file caption.this.mp4 --format any --output each file - @output.each.detailed
Since the @output.each.detailed
argument was set, the output includes the following column headers:
audio.input.id event event.sessionid result.reason result.latency result.text result.json
In the result.json
column, you can find details that include offset and duration for the Recognizing
and Recognized
events:
{
"Id": "492574cd8555481a92c22f5ff757ef17",
"RecognitionStatus": "Success",
"DisplayText": "Welcome to applied Mathematics course 201.",
"Offset": 1800000,
"Duration": 30500000
}
For more information, see the Speech CLI datastore configuration and output options.
Example offset and duration
The following table shows potential offset and duration in ticks when a speaker says "Welcome to Applied Mathematics course 201." In this example, the offset doesn't change throughout the Recognizing
and Recognized
events. However, don't rely on the offset to remain the same between the Recognizing
and Recognized
events, since the final result could be different.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | welcome | 17000000 | 5000000 |
RECOGNIZING | welcome to | 17000000 | 6400000 |
RECOGNIZING | welcome to applied math | 17000000 | 13600000 |
RECOGNIZING | welcome to applied mathematics | 17000000 | 17200000 |
RECOGNIZING | welcome to applied mathematics course | 17000000 | 23700000 |
RECOGNIZING | welcome to applied mathematics course 2 | 17000000 | 26700000 |
RECOGNIZING | welcome to applied mathematics course 201 | 17000000 | 33400000 |
RECOGNIZED | Welcome to applied Mathematics course 201. | 17000000 | 34500000 |
The total duration of the first utterance was 3.45 seconds. It was recognized at 1.7 to 5.15 seconds offset from the start of the audio stream being recognized (00:00:01.700 --> 00:00:05.150).
If the speaker continues then to say "Let's get started," a new offset is calculated from the start of the audio stream being recognized, to the start of the new utterance. The following table shows potential offset and duration for an utterance that started two seconds after the previous utterance ended.
Event | Text | Offset (in ticks) | Duration (in ticks) |
---|---|---|---|
RECOGNIZING | OK | 71500000 | 3100000 |
RECOGNIZING | OK now | 71500000 | 10300000 |
RECOGNIZING | OK now let's | 71500000 | 14700000 |
RECOGNIZING | OK now let's get started | 71500000 | 18500000 |
RECOGNIZED | OK, now let's get started. | 71500000 | 20600000 |
The total duration of the second utterance was 2.06 seconds. It was recognized at 7.15 to 9.21 seconds offset from the start of the audio stream being recognized (00:00:07.150 --> 00:00:09.210).