Word Offset By Result.Text and not LexicalForm

Question

Support,

Issue #1
So in order to track audio with outputted text, I need the e.Result.Best().words to be the e.Result.Text words to be the same.
Example:
Say "James Bond 007" is e.Result.Text
e.Result.Best().words is an array of "James", "Bond", "double", "oh", "seven" with the word offsets. The problem is than there is no match up of what words equal the text output. Maybe have e.Result.Best().LexicalWords and e.Result.Best().NormalizedWords so that a developer can track audio vs output text.

Issue #2
Saying "New LIne" outputs
Saying "New Paragraph" outputs

Ideal output would be
New Line outputs
New Paragraph output

Issue #3
Is there Azure Cognitive Speech To Text documentation for the
SpeechConfig.SetServiceProperty options?
I currently found one config.SetServiceProperty("punctuation", "explicit", ServicePropertyChannel.UriQueryParameter);
that removes the period at the end of each phrase.
It would be nice to have these available as I can not find any reference in the documentation.

Answer

Support,

I've reviewed the JSON and it still doesn't solve the problem. I need to know the relationship of the DisplayText words to the Word Timings in the detail When the DisplayText outputs 007 and the Word Timings output "double" "oh" "seven" as 3 different words I don't know that 007 = those three words as there is no reference. There needs to be a display word reference to the audio word to track offset/duration of an underlying audio file. The only option that we would have is to only use the words and build a post processor to do the conversion that the model is already doing in the formatted DisplayText to allow that reference to be maintained. Ideally, there should be two word lists. The current one with the Lexical version (that it appears to be) and a new one that would be the final formatted words that would combine the Offset/Duration for the formatted phrase. This would allow to have to exact offset/duration for the outputted word. Example:
Option for new list of words
Word = 007, Offset = 1000, Duration = 30000
Current
Word = "double", Offset=1000, Duration = 10000
Word = "oh", Offset=11000, Duration = 5000
Word = "seven", Offset=16000, Duration = 15000
Please submit this to development as it is a deal breaker for anyone trying to sync audio

Share via

Word Offset By Result.Text and not LexicalForm

1 answer