Transcription offsets incorrect for analyzed audio file

Christopher Watford 1 Reputation point
2021-06-08T15:31:02.757+00:00

While investigating some transcription oddities I found that one of our transcripts has offsets which are off by ~20 seconds.

The asset metadata shows:
"AssetFile": [
{
"StartTime": "PT0.528S",
"Duration": "PT37M41.635S"
}
]

And reviewing the first block of spoken text from the transcription:
{
"id": 1,
"text": "Hi Melanie. Yes,",
"confidence": 0.558,
"speakerId": 1,
"language": "en-US",
"instances": [
{
"adjustedStart": "0:00:32.39",
"adjustedEnd": "0:00:35.76",
"start": "0:00:32.39",
"end": "0:00:35.76"
}
]
},

Reviewing the audio file itself the first word spoken occurs at ~0:00:52, 20 seconds out of sync.

A separate audio track of another speaker was correctly encoded and time stamped by the service. When both audio tracks are played side-by-side locally their timing is correct.

Is there a way to get the correct offsets into the file? Is there an option I'm missing or a piece of data that I should be retrieving to "fix-up" these timestamps?

Azure Media Services
Azure Media Services
A group of Azure services that includes encoding, format conversion, on-demand streaming, content protection, and live streaming services.
312 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. John Deutscher (MSFT) 2,126 Reputation points
    2021-06-08T17:19:01.78+00:00

    That could indicate a problem with the timestamps in the source file, or the presence of an "edit" atom (MP4 format thing!) that has an offset introduced into the timestamps of the MP4 file. The only way we would be able to confirm is to look at the source file and see what happened.
    Is this happening on lots of your content? Or just a few files? What is the source of those files that are showing this issue?

    To have someone on our team take a look at the file, please file a support ticket in the Azure Portal from inside your Media Services account and you will then be able to upload the source file for the support team to take a closer look at.

    0 comments No comments

  2. Christopher Watford 1 Reputation point
    2021-06-08T18:05:17.377+00:00

    From what I can find, this is not a common occurrence. The audio tracks come from Twilio:

    Metadata:
    encoder : GStreamer matroskamux version 1.8.1.1
    creation_time : ...
    Duration: 00:37:41.63, start: 0.528000, bitrate: 36 kb/s
    Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
    Metadata:
    title : Audio

    Reviewing the OPUS data it looks like there is a blank "gap" early in the track which would account for that offset:

    • The first cluster begins at 0.528 as you'd expect
    • The first cluster ends, without any large gaps, at 33.277
    • The second cluster begins at 33.298
    • There is a gap between the simple blocks with timestamps 34.298 and 49.559

    When I extract the OPUS from the WebM container myself using FFMPEG the gaps appear to be cut out and lo and behold I get the same audio offset as Azure Media Services. Common software? Common failure? Not sure.

    I'll open a ticket, thank you!

    0 comments No comments