How can we scale Azure AI Foundry LLM usage so that multiple users can use the models at once, each user with the capacity to ask queries that require 25,000 to 100,000 tokens?

Cody Tipping 0 Reputation points
2025-06-10T11:56:08.8+00:00

We are building an AI application, using Azure AI Foundry, that ingests Police Crime data for analysis. The problem is certain queries can trigger large data requests. Even after filtering out the aspects of each report that are irrelevant for the questions, the associated total token counts can be quite large, often exceeding the token limit for the model and Azure AI Foundry tier.

At a PoC stage, we can work around this, because we can partition the request into several LLM calls and then synthesize the results. Even that is tricky, though, because we are capped to 50,000 tokens per minute, so there is an odd amount of latency from the user experience side.

But, when think of scaling, there's obviously a huge obstacle. My question is, how can we continue using Azure AI Foundry for hundreds of users, all of whom might make concurrent calls with queries that have thousands of tokens?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,081 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator
    2025-06-11T03:09:22.2566667+00:00

    Hi Cody Tipping,

    Thank you for reaching out. Scaling Azure AI Foundry to support multiple concurrent users with high token usage can indeed present challenges, particularly when dealing with large inputs and throughput limits. To handle this effectively, here are several strategies you can consider:

    Firstly, while the maximum token limit per request is defined by the model itself (e.g., 128k tokens for GPT-4-128k), you can request an increase in your throughput quota — specifically, tokens per minute (TPM), requests per minute (RPM), and concurrent requests. This can be done via the Azure portal under Service and subscription limits (quotas) by providing your expected workload details.

    For large and non-time-sensitive queries, consider implementing batch-style processing to handle tasks asynchronously. This approach helps distribute the load and reduces front-end latency. If you’re working in a multi-tenant scenario, you can choose between shared or separate processing pipelines per user group, depending on your design.

    Since you’re already partitioning large requests, you may benefit from further optimizing your chunking strategy — for example, by using semantic segmentation or embedding-based filtering to ensure only the most relevant content is passed to the model. This not only improves performance but also reduces token consumption.

    To handle the 50,000 tokens per minute cap more gracefully, it’s a good practice to implement retry logic with exponential backoff in your application. This helps manage rate-limiting responses (like 429 errors) and maintains a smoother user experience during peak usage.

    We also recommend monitoring your usage patterns closely, so you can fine-tune request rates and scale gradually as demand increases. For early-stage or temporary scaling needs, you may also explore using shared quota pools, if applicable, to support burst capacity while awaiting quota increases.

    Combining these approaches will help ensure a more scalable and responsive experience for your users as your application grows.

    I hope this information helps. Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.