Word Offset By Result.Text and not LexicalForm

Dave Revell 1 Reputation point
2020-12-08T21:10:26.943+00:00

Support,

Issue #1
So in order to track audio with outputted text, I need the e.Result.Best().words to be the e.Result.Text words to be the same.
Example:
Say "James Bond 007" is e.Result.Text
e.Result.Best().words is an array of "James", "Bond", "double", "oh", "seven" with the word offsets. The problem is than there is no match up of what words equal the text output. Maybe have e.Result.Best().LexicalWords and e.Result.Best().NormalizedWords so that a developer can track audio vs output text.

Issue #2
Saying "New LIne" outputs \n
Saying "New Paragraph" outputs \n

Ideal output would be
New Line outputs \r\n
New Paragraph output \r\n\r\n

Issue #3
Is there Azure Cognitive Speech To Text documentation for the
SpeechConfig.SetServiceProperty options?
I currently found one config.SetServiceProperty("punctuation", "explicit", ServicePropertyChannel.UriQueryParameter);
that removes the period at the end of each phrase.
It would be nice to have these available as I can not find any reference in the documentation.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,391 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Dave Revell 1 Reputation point
    2020-12-20T23:01:03.123+00:00

    Support,

    I've reviewed the JSON and it still doesn't solve the problem. I need to know the relationship of the DisplayText words to the Word Timings in the detail When the DisplayText outputs 007 and the Word Timings output "double" "oh" "seven" as 3 different words I don't know that 007 = those three words as there is no reference. There needs to be a display word reference to the audio word to track offset/duration of an underlying audio file. The only option that we would have is to only use the words and build a post processor to do the conversion that the model is already doing in the formatted DisplayText to allow that reference to be maintained. Ideally, there should be two word lists. The current one with the Lexical version (that it appears to be) and a new one that would be the final formatted words that would combine the Offset/Duration for the formatted phrase. This would allow to have to exact offset/duration for the outputted word. Example:
    Option for new list of words
    Word = 007, Offset = 1000, Duration = 30000
    Current
    Word = "double", Offset=1000, Duration = 10000
    Word = "oh", Offset=11000, Duration = 5000
    Word = "seven", Offset=16000, Duration = 15000
    Please submit this to development as it is a deal breaker for anyone trying to sync audio