@Alexander G With respect to performance of speech to text there is no baseline claim for processing the audio and returning the text response since there are many factors that effect the response including the audio quality, network bandwidth, SDK or REST API used, pricing tier of the resource.
However, there are a few guidelines mentioned in the FAQ that help in the performance and in most of the cases including the transcription scenarios the response is fast. For batch transcription the jobs are scheduled on a best effort basis. You cannot estimate when a job will change into the running state, but it should happen within minutes under normal system load. Once in the running state, the transcription occurs faster than the audio runtime playback speed.
I hope this helps.