Transcription offsets incorrect for analyzed audio file

Question

While investigating some transcription oddities I found that one of our transcripts has offsets which are off by ~20 seconds.

The asset metadata shows:
"AssetFile": [
{
"StartTime": "PT0.528S",
"Duration": "PT37M41.635S"
}
]

And reviewing the first block of spoken text from the transcription:
{
"id": 1,
"text": "Hi Melanie. Yes,",
"confidence": 0.558,
"speakerId": 1,
"language": "en-US",
"instances": [
{
"adjustedStart": "0:00:32.39",
"adjustedEnd": "0:00:35.76",
"start": "0:00:32.39",
"end": "0:00:35.76"
}
]
},

Reviewing the audio file itself the first word spoken occurs at ~0:00:52, 20 seconds out of sync.

A separate audio track of another speaker was correctly encoded and time stamped by the service. When both audio tracks are played side-by-side locally their timing is correct.

Is there a way to get the correct offsets into the file? Is there an option I'm missing or a piece of data that I should be retrieving to "fix-up" these timestamps?

Answer

That could indicate a problem with the timestamps in the source file, or the presence of an "edit" atom (MP4 format thing!) that has an offset introduced into the timestamps of the MP4 file. The only way we would be able to confirm is to look at the source file and see what happened.
Is this happening on lots of your content? Or just a few files? What is the source of those files that are showing this issue?

To have someone on our team take a look at the file, please file a support ticket in the Azure Portal from inside your Media Services account and you will then be able to upload the source file for the support team to take a closer look at.

Answer

From what I can find, this is not a common occurrence. The audio tracks come from Twilio:

Metadata:
encoder : GStreamer matroskamux version 1.8.1.1
creation_time : ...
Duration: 00:37:41.63, start: 0.528000, bitrate: 36 kb/s
Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
Metadata:
title : Audio

Reviewing the OPUS data it looks like there is a blank "gap" early in the track which would account for that offset:

The first cluster begins at 0.528 as you'd expect
The first cluster ends, without any large gaps, at 33.277
The second cluster begins at 33.298
There is a gap between the simple blocks with timestamps 34.298 and 49.559

When I extract the OPUS from the WebM container myself using FFMPEG the gaps appear to be cut out and lo and behold I get the same audio offset as Azure Media Services. Common software? Common failure? Not sure.

I'll open a ticket, thank you!

Share via

Transcription offsets incorrect for analyzed audio file

2 answers