Hi Tony Su,
Thank you for the detailed follow-up, and I completely understand the frustration especially when the service behavior doesn’t match the expected concurrency numbers. Let me clarify each of your questions and also explain what is actually happening under the hood with Fast Transcription so you can move forward with accurate expectations.
Is pull-from-storage really applicable to Fast Transcription?
You're absolutely right the primary workflow for pull-from-storage is designed for Batch Transcription, not Fast Transcription.
Fast Transcription can technically read from storage URLs, but:
- It still processes the audio as a fast job, not batch
It does not give the same high concurrency as batch pull-based workloads
Uploading to storage does introduce extra latency, as you mentioned
So yes your understanding is correct. Pull-from-storage is not a recommended pattern to improve Fast Transcription throughput.
If your requirement is high throughput + predictable concurrency scaling, Azure Speech recommends Batch Transcription, not Fast Transcription.
Fast Transcription is optimized for low-latency single-job processing, not high parallelism.
Why split audio into 60–120 second chunks?
This recommendation is not about improving model accuracy — it’s about controlling the backend queue limits per tenant.
Here’s the important clarification:
Even though the Fast Transcription API advertises up to 100 concurrent requests, the actual backend compute available per tenant per region is much smaller, and the rest of the jobs wait in queue.
Large files (e.g., 5 minutes or more) occupy backend workers for longer, which reduces effective throughput.
Splitting into 60–120 sec segments helps because:
Jobs finish faster and free workers sooner
Queue churn is higher, improving total throughput
Parallelism becomes more efficient
Retry logic and failure handling are easier
You avoid long queue times accumulating into one long job
But if context window continuity is important for your ASR use case, then segmenting may not be ideal — and it’s perfectly valid that you prefer full-file transcription.
How do I monitor queue vs. processing state?
You’re right the Fast Transcription API does not expose the same detailed job lifecycle as Batch Transcription.
Fast Transcription jobs expose:
created
running
succeeded / failed
But they do NOT differentiate queue time vs. compute time, unlike batch jobs.
This is why it's difficult to analyze queuing behavior with Fast Transcription — and the frustration you're seeing is understandable.
We typically recommend:
Measuring client-side latency
Logging job creation timestamp vs. job completion time
Correlating outliers to infer queue delays
There is no dedicated API to show internal queue metrics for Fast Transcription.
The core issue: Fast Transcription concurrency numbers
This is the key point most customers are unaware of:
The documented limit (e.g., 100 concurrent requests) refers to admission concurrency, not compute concurrency.
Meaning:
Yes, you can send 100 jobs at once
But backend compute workers per tenant per region are capped
Jobs beyond that cap will wait in queue, which is exactly what you're seeing
This is why your throughput does not scale linearly even though concurrency limits appear higher.
You’re hitting the tenant-level backend cap, not the documented public limit.
Could you please take a moment to retake the survey by accepting the answer on the response? Your feedback is greatly appreciated.
Thank you!