Hi Cody Tipping,
Thank you for reaching out. Scaling Azure AI Foundry to support multiple concurrent users with high token usage can indeed present challenges, particularly when dealing with large inputs and throughput limits. To handle this effectively, here are several strategies you can consider:
Firstly, while the maximum token limit per request is defined by the model itself (e.g., 128k tokens for GPT-4-128k), you can request an increase in your throughput quota — specifically, tokens per minute (TPM), requests per minute (RPM), and concurrent requests. This can be done via the Azure portal under Service and subscription limits (quotas) by providing your expected workload details.
For large and non-time-sensitive queries, consider implementing batch-style processing to handle tasks asynchronously. This approach helps distribute the load and reduces front-end latency. If you’re working in a multi-tenant scenario, you can choose between shared or separate processing pipelines per user group, depending on your design.
Since you’re already partitioning large requests, you may benefit from further optimizing your chunking strategy — for example, by using semantic segmentation or embedding-based filtering to ensure only the most relevant content is passed to the model. This not only improves performance but also reduces token consumption.
To handle the 50,000 tokens per minute cap more gracefully, it’s a good practice to implement retry logic with exponential backoff in your application. This helps manage rate-limiting responses (like 429 errors) and maintains a smoother user experience during peak usage.
We also recommend monitoring your usage patterns closely, so you can fine-tune request rates and scale gradually as demand increases. For early-stage or temporary scaling needs, you may also explore using shared quota pools, if applicable, to support burst capacity while awaiting quota increases.
Combining these approaches will help ensure a more scalable and responsive experience for your users as your application grows.
I hope this information helps. Thank you!