which charachteristic of the audio file impact the time needed for its batch transcription?

Question

Hello everybody,

i have just discovered the speech to text APIs and I'm amazed by them.
I use a power automate flow to transcript the audio files.

I have noticed though that the running time needed for the transcription varies in quite a relevant way: a fresh example, an audio POSTed for transcription with length 1:14:36 (mono) was successfully transcribed in 01:19:17, whereas a file lasting 03:14:59 (mono) has been running for 11:22:05 (and counting).

The parameter provided are always the same, namely:

"locale": "pl-PL",
"properties": {
"wordLevelTimestampsEnabled": "true",
"diarizationEnabled": "true",
"profanityFilterMode": "none",
"punctuationMode": "DictatedAndAutomatic"
}

So my question is, where can i find information on what impact the speed of the transcription?

Thank you in advance!

Best regards,

Vittorio

Addendum: could it be an issue of diarization and possibly more than 2 speakers (usually the audios are axtracted from meetings)?

Answer

@Vittorio Tison It looks like your second file transcription might be stuck in processing state because batch transcription jobs are scheduled on a best effort basis and we cannot estimate when a job will change into the running state, but it should happen within minutes under normal system load. Once in the running state, the transcription occurs faster than the audio runtime playback speed. I think this transcription might go into failed state or you can request for this to be terminated by raising a support issue.

With respect to diarization with the batch transcription API it is only possible for 2 users without enrollment. The API could be extended to support more users in the future. If the audio contains more than two speakers now it would not be able to process the file correctly, You can disable diarization and try again in this case. Thanks.

which charachteristic of the audio file impact the time needed for its batch transcription?

1 answer