Fast Transcription API not achieving expected concurrent processing throughput on Standard Tier

Question

Fast Transcription API not achieving expected concurrent processing throughput on Standard Tier

Tony Su 0

I'm using Azure's Fast Transcription API on a Standard Tier subscription, which according to the documentation should support up to 100 concurrent requests. However, in my experiments, I'm not seeing throughput scale proportionally with increased concurrency.

Result:

User's image

I conducted fast transcription requests using 5-minute audio files under different levels of concurrency.

Observations:

My subscription tier: Standard (100 concurrent requests limit)
Expected behavior: Throughput should increase with more concurrent requests
Actual behavior: Throughput does not scale up as expected when sending multiple concurrent requests

Question: Does the Fast Transcription API actually support true concurrent processing at the endpoint level? Or are there backend limitations that prevent achieving the documented concurrency limits in practice?

0 comments

2 answers

Your answer

Answer 1

Amira Bedhiafi 41,966 MVP Volunteer Moderator

Hello Tony !

Thank you for posting on Microsoft Learn Q&A.

Fast Transcription on the shared pool is queue based and each tenant has a smaller backend parallelism cap per region. The endpoint admits many jobs but the region's shared workers run only N jobs per tenant at a time extra jobs wait.

Standard uses shared capacity and parallelism fluctuates with regional load.

So what to do ?

Try to create 2–3 Speech resources in different supported regions and round-robin jobs and let Speech pull from Storage instead of streaming the bytes and storage in the same region as the Speech resource.

Then split into 60 to 120s segments and process in parallel and merge transcripts afterward.

You can choose Mono PCM WAV (16 kHz) to avoid decode overhead from compressed formats and instead of firing 50 jobs at once, keep a moving window of K in-flight jobs.

Then log per job created to running (queue time) and running to succeeded (processing time) to confirm whether the limiter is queueing vs. compute.

Tony Su 0 Reputation points

2025-11-12T15:38:22.8833333+00:00

Thanks for your reply! I still have some question though

Try to create 2–3 Speech resources in different supported regions and round-robin jobs and let Speech pull from Storage instead of streaming the bytes and storage in the same region as the Speech resource.

Isn't the pull-from-storage method primarily designed for Azure's Batch Transcription API? If it's possible to use this with the Fast Transcription API, wouldn't the process of uploading files to storage introduce extra latency? My understanding was that the key benefit of the Fast Transcription API is its ability to directly transcribe raw audio files without the batch processing overhead required for storage-based methods.

Then split into 60 to 120s segments and process in parallel and merge transcripts afterward.

Most ASR models benefit from having larger context windows, and as far as I know, the Fast Transcription API supports audio files up to 2 hours long. What's the reasoning behind splitting files into 1-2 minute chunks for better throughput, rather than sending the complete audio file directly?

Then log per job created to running (queue time) and running to succeeded (processing time) to confirm whether the limiter is queueing vs. compute.

I haven't been able to find documentation on how to monitor the status of Fast Transcription tasks to distinguish between queued and processing states. Could you point me to where I can find this information? This would be really helpful for diagnosing whether the bottleneck is in queuing or actual compute resources.

Thanks again for your help!

Answer 2

Hi Tony Su,

Thank you for the detailed follow-up, and I completely understand the frustration especially when the service behavior doesn’t match the expected concurrency numbers. Let me clarify each of your questions and also explain what is actually happening under the hood with Fast Transcription so you can move forward with accurate expectations.

Is pull-from-storage really applicable to Fast Transcription?

You're absolutely right the primary workflow for pull-from-storage is designed for Batch Transcription, not Fast Transcription.

Fast Transcription can technically read from storage URLs, but:

It still processes the audio as a fast job, not batch

It does not give the same high concurrency as batch pull-based workloads

Uploading to storage does introduce extra latency, as you mentioned

So yes your understanding is correct. Pull-from-storage is not a recommended pattern to improve Fast Transcription throughput.

If your requirement is high throughput + predictable concurrency scaling, Azure Speech recommends Batch Transcription, not Fast Transcription.

Fast Transcription is optimized for low-latency single-job processing, not high parallelism.

Why split audio into 60–120 second chunks?

This recommendation is not about improving model accuracy — it’s about controlling the backend queue limits per tenant.

Here’s the important clarification:

Even though the Fast Transcription API advertises up to 100 concurrent requests, the actual backend compute available per tenant per region is much smaller, and the rest of the jobs wait in queue.

Large files (e.g., 5 minutes or more) occupy backend workers for longer, which reduces effective throughput.

Splitting into 60–120 sec segments helps because:

Jobs finish faster and free workers sooner

Queue churn is higher, improving total throughput

Parallelism becomes more efficient

Retry logic and failure handling are easier

You avoid long queue times accumulating into one long job

But if context window continuity is important for your ASR use case, then segmenting may not be ideal — and it’s perfectly valid that you prefer full-file transcription.

How do I monitor queue vs. processing state?

You’re right the Fast Transcription API does not expose the same detailed job lifecycle as Batch Transcription.

Fast Transcription jobs expose:

created

running

succeeded / failed

But they do NOT differentiate queue time vs. compute time, unlike batch jobs.

This is why it's difficult to analyze queuing behavior with Fast Transcription — and the frustration you're seeing is understandable.

We typically recommend:

Measuring client-side latency

Logging job creation timestamp vs. job completion time

Correlating outliers to infer queue delays

There is no dedicated API to show internal queue metrics for Fast Transcription.

The core issue: Fast Transcription concurrency numbers

This is the key point most customers are unaware of:

The documented limit (e.g., 100 concurrent requests) refers to admission concurrency, not compute concurrency.

Meaning:

Yes, you can send 100 jobs at once

But backend compute workers per tenant per region are capped

Jobs beyond that cap will wait in queue, which is exactly what you're seeing

This is why your throughput does not scale linearly even though concurrency limits appear higher.

You’re hitting the tenant-level backend cap, not the documented public limit.

Could you please take a moment to retake the survey by accepting the answer on the response? Your feedback is greatly appreciated.

Thank you!

Tony Su 0 Reputation points

2025-11-14T13:37:28.98+00:00

Thank you for the clarification — that really helped.

I now understand that the throughput is limited by a tenant-level backend cap.

I’d like to check whether this cap is negotiable or if it can be increased.

Our product may require a higher concurrency level, and we’re very interested in using Azure’s services for this workload.

If possible, could you advise on the available options for raising the limit or upgrading our plan to support higher throughput?
SRILAKSHMI C 18,745 Reputation points Microsoft External Staff Moderator

2025-11-18T17:43:12.6833333+00:00

Hi Tony Su,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.

Thank you!
SRILAKSHMI C 18,745 Reputation points Microsoft External Staff Moderator

2025-11-19T14:08:12.2066667+00:00

Hi Tony Su,

Just checking in to see if you had a chance to review my above response. Please let me know if you have any additional questions I’m happy to help.

Also, if possible, could you please retake the survey for the previous response? Your feedback is greatly appreciated.

Thank you!

Share via

Fast Transcription API not achieving expected concurrent processing throughput on Standard Tier

2 answers

Your answer