which charachteristic of the audio file impact the time needed for its batch transcription?

Vittorio Tison 51 Reputation points
2021-04-11T08:21:52.857+00:00

Hello everybody,

i have just discovered the speech to text APIs and I'm amazed by them.
I use a power automate flow to transcript the audio files.

I have noticed though that the running time needed for the transcription varies in quite a relevant way: a fresh example, an audio POSTed for transcription with length 1:14:36 (mono) was successfully transcribed in 01:19:17, whereas a file lasting 03:14:59 (mono) has been running for 11:22:05 (and counting).

The parameter provided are always the same, namely:

"locale": "pl-PL",
"properties": {
"wordLevelTimestampsEnabled": "true",
"diarizationEnabled": "true",
"profanityFilterMode": "none",
"punctuationMode": "DictatedAndAutomatic"
}

So my question is, where can i find information on what impact the speed of the transcription?

Thank you in advance!

Best regards,

Vittorio

Addendum: could it be an issue of diarization and possibly more than 2 speakers (usually the audios are axtracted from meetings)?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,391 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,362 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 41,866 Reputation points Microsoft Employee
    2021-04-12T08:02:18.113+00:00

    @Vittorio Tison It looks like your second file transcription might be stuck in processing state because batch transcription jobs are scheduled on a best effort basis and we cannot estimate when a job will change into the running state, but it should happen within minutes under normal system load. Once in the running state, the transcription occurs faster than the audio runtime playback speed. I think this transcription might go into failed state or you can request for this to be terminated by raising a support issue.

    With respect to diarization with the batch transcription API it is only possible for 2 users without enrollment. The API could be extended to support more users in the future. If the audio contains more than two speakers now it would not be able to process the file correctly, You can disable diarization and try again in this case. Thanks.