Support,
I've reviewed the JSON and it still doesn't solve the problem. I need to know the relationship of the DisplayText words to the Word Timings in the detail When the DisplayText outputs 007 and the Word Timings output "double" "oh" "seven" as 3 different words I don't know that 007 = those three words as there is no reference. There needs to be a display word reference to the audio word to track offset/duration of an underlying audio file. The only option that we would have is to only use the words and build a post processor to do the conversion that the model is already doing in the formatted DisplayText to allow that reference to be maintained. Ideally, there should be two word lists. The current one with the Lexical version (that it appears to be) and a new one that would be the final formatted words that would combine the Offset/Duration for the formatted phrase. This would allow to have to exact offset/duration for the outputted word. Example:
Option for new list of words
Word = 007, Offset = 1000, Duration = 30000
Current
Word = "double", Offset=1000, Duration = 10000
Word = "oh", Offset=11000, Duration = 5000
Word = "seven", Offset=16000, Duration = 15000
Please submit this to development as it is a deal breaker for anyone trying to sync audio