Share via

Fast Transcription API not achieving expected concurrent processing throughput on Standard Tier

Tony Su 0 Reputation points
2025-11-12T10:12:14.07+00:00

I'm using Azure's Fast Transcription API on a Standard Tier subscription, which according to the documentation should support up to 100 concurrent requests. However, in my experiments, I'm not seeing throughput scale proportionally with increased concurrency.

Result:

User's image

I conducted fast transcription requests using 5-minute audio files under different levels of concurrency.

Observations:

  • My subscription tier: Standard (100 concurrent requests limit)
  • Expected behavior: Throughput should increase with more concurrent requests
  • Actual behavior: Throughput does not scale up as expected when sending multiple concurrent requests

Question: Does the Fast Transcription API actually support true concurrent processing at the endpoint level? Or are there backend limitations that prevent achieving the documented concurrency limits in practice?

Azure Speech in Foundry Tools
0 comments No comments

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 41,966 Reputation points MVP Volunteer Moderator
    2025-11-12T13:35:15.2+00:00

    Hello Tony !

    Thank you for posting on Microsoft Learn Q&A.

    Fast Transcription on the shared pool is queue based and each tenant has a smaller backend parallelism cap per region. The endpoint admits many jobs but the region's shared workers run only N jobs per tenant at a time extra jobs wait.

    Standard uses shared capacity and parallelism fluctuates with regional load.

    So what to do ?

    Try to create 2–3 Speech resources in different supported regions and round-robin jobs and let Speech pull from Storage instead of streaming the bytes and storage in the same region as the Speech resource.

    Then split into 60 to 120s segments and process in parallel and merge transcripts afterward.

    You can choose Mono PCM WAV (16 kHz) to avoid decode overhead from compressed formats and instead of firing 50 jobs at once, keep a moving window of K in-flight jobs.

    Then log per job created to running (queue time) and running to succeeded (processing time) to confirm whether the limiter is queueing vs. compute.

    Was this answer helpful?

    1 person found this answer helpful.

  2. SRILAKSHMI C 18,745 Reputation points Microsoft External Staff Moderator
    2025-11-14T13:16:17.7066667+00:00

    Hi Tony Su,

    Thank you for the detailed follow-up, and I completely understand the frustration especially when the service behavior doesn’t match the expected concurrency numbers. Let me clarify each of your questions and also explain what is actually happening under the hood with Fast Transcription so you can move forward with accurate expectations.

    Is pull-from-storage really applicable to Fast Transcription?

    You're absolutely right the primary workflow for pull-from-storage is designed for Batch Transcription, not Fast Transcription.

    Fast Transcription can technically read from storage URLs, but:

    • It still processes the audio as a fast job, not batch

    It does not give the same high concurrency as batch pull-based workloads

    Uploading to storage does introduce extra latency, as you mentioned

    So yes your understanding is correct. Pull-from-storage is not a recommended pattern to improve Fast Transcription throughput.

    If your requirement is high throughput + predictable concurrency scaling, Azure Speech recommends Batch Transcription, not Fast Transcription.

    Fast Transcription is optimized for low-latency single-job processing, not high parallelism.

    Why split audio into 60–120 second chunks?

    This recommendation is not about improving model accuracy — it’s about controlling the backend queue limits per tenant.

    Here’s the important clarification:

    Even though the Fast Transcription API advertises up to 100 concurrent requests, the actual backend compute available per tenant per region is much smaller, and the rest of the jobs wait in queue.

    Large files (e.g., 5 minutes or more) occupy backend workers for longer, which reduces effective throughput.

    Splitting into 60–120 sec segments helps because:

    Jobs finish faster and free workers sooner

    Queue churn is higher, improving total throughput

    Parallelism becomes more efficient

    Retry logic and failure handling are easier

    You avoid long queue times accumulating into one long job

    But if context window continuity is important for your ASR use case, then segmenting may not be ideal — and it’s perfectly valid that you prefer full-file transcription.

    How do I monitor queue vs. processing state?

    You’re right the Fast Transcription API does not expose the same detailed job lifecycle as Batch Transcription.

    Fast Transcription jobs expose:

    created

    running

    succeeded / failed

    But they do NOT differentiate queue time vs. compute time, unlike batch jobs.

    This is why it's difficult to analyze queuing behavior with Fast Transcription — and the frustration you're seeing is understandable.

    We typically recommend:

    Measuring client-side latency

    Logging job creation timestamp vs. job completion time

    Correlating outliers to infer queue delays

    There is no dedicated API to show internal queue metrics for Fast Transcription.

    The core issue: Fast Transcription concurrency numbers

    This is the key point most customers are unaware of:

    The documented limit (e.g., 100 concurrent requests) refers to admission concurrency, not compute concurrency.

    Meaning:

    Yes, you can send 100 jobs at once

    But backend compute workers per tenant per region are capped

    Jobs beyond that cap will wait in queue, which is exactly what you're seeing

    This is why your throughput does not scale linearly even though concurrency limits appear higher.

    You’re hitting the tenant-level backend cap, not the documented public limit.

    Could you please take a moment to retake the survey by accepting the answer on the response? Your feedback is greatly appreciated.

    Thank you!

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.