How to increase Azure OpenAI concurrency?

Question

How to increase Azure OpenAI concurrency?

Manmohan Gupta 5

Hello Guys,

I have created an service on Azure OpenAI and also deployed an model (gpt-35-turbo). I have done a load test but I am getting success only one hit per sec, rest other requests are failing.

I have done load testing using Apache Benchmark, Here are the details:

ab -n 100 -c 100 -p data.json -T application/json -r "http://localhost:3002/message"

On 3002 an node application is running which will hit Azure OpenAI.

If I do same thing using OpenAI APIs then I am able to scale it.

Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-10-07T15:09:00.0766667+00:00

Manmohan Gupta: Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-10-12T10:09:51.6666667+00:00

Manmohan Gupta: Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will also help us close this thread and acknowledge the time spent by community volunteers like us.
Tung Nguyen Xuan 70 Reputation points

2025-01-15T14:00:32.04+00:00

I have the same question

1 answer

Your answer

Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-10-07T15:09:00.0766667+00:00

Manmohan Gupta: Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-10-12T10:09:51.6666667+00:00

Manmohan Gupta: Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will also help us close this thread and acknowledge the time spent by community volunteers like us.
Tung Nguyen Xuan 70 Reputation points

2025-01-15T14:00:32.04+00:00

I have the same question

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

TLDR: To increase concurrency in Azure openai:

Scale up your model deployment by increasing capacity units.
Adjust quotas to allow for higher request and token limits.
Optimize your application for concurrency and efficient resource utilization.
Monitor and handle errors related to throttling or limits.
Consult documentation and support to ensure all configurations align with best practices.

Detailed summary:

To increase the concurrency of your azure openai deployment and handle more than one request per second, you'll need to adjust several aspects of your setup.

Here are the steps to help you achieve higher throughput:

Check Current Deployment Capacity:
- Capacity Units (CUs): azure OpenAI uses capacity units to determine the throughput and concurrency of your model deployments. By default, deployments may start with minimal capacity.
- Action: Navigate to your azure OpenAI resource in the Azure Portal and check the number of capacity units allocated to your gpt-35-turbo deployment.
Scale Up Your Deployment:
- Increase Capacity Units:
- Effect: Allocating more capacity units will allow your deployment to handle more concurrent requests and higher overall throughput.
- Action: In the azure Portal, select your deployed model and adjust the number of capacity units to a higher value that matches your concurrency needs.
- Restart the Deployment:
- After scaling, ensure you restart the deployment for the changes to take effect.
Review and Adjust Quotas:
- Default Quotas:
- azure OpenAI enforces default quotas on requests per minute and tokens per minute, which might be limiting your throughput.
- Action: Check your current quotas by navigating to the "Usage + quotas" section of your azure openAI resource.
- Request Quota Increase:
- If the default quotas are insufficient, submit a quota increase request via the Azure Portal:
  - Go to your Azure subscription.
  - Select "Usage + quotas".
  - Find "Azure OpenAI Service".
  - Click on "Request Increase" and fill out the necessary details.
Optimize Your Node.js Application:
- Connection Management:
- Ensure that your application efficiently manages HTTP connections, possibly using connection pooling.
- Asynchronous Requests:
- Use asynchronous programming practices to handle multiple requests concurrently without blocking the event loop.
- Error Handling:
- Implement robust error handling to retry failed requests when appropriate.
Monitor Throttling and Errors:
- Check Error Messages:
- Failed requests often return error messages indicating throttling or quota limits.
- Action: Log and review the error responses from azure OpenAI to identify if you're hitting specific limits.
Use Batch Requests (if applicable):
- Batch Processing:
- If your use case allows, send batch requests to process multiple inputs in a single API call.
- Action: Modify your application to group requests and handle batch responses.
Consult Azure OpenAI Documentation:
- Best Practices:
- Review the Azure OpenAI Service best practices for performance optimization.
- Scaling Guidance:
- Read about scaling deployments to understand how capacity units affect performance.
Compare with OpenAI API Configuration:
- Configuration Differences:
- Since you mentioned better scaling with the OpenAI API, compare the configuration settings between your Azure OpenAI and OpenAI API implementations.
- Action: Align your Azure OpenAI deployment settings with those that work in your OpenAI API setup, adjusting for any platform differences.

By following these steps, you should be able to achieve higher concurrency with your azure OpenAI deployment, similar to the performance you're experiencing with the openAI API.

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Share via

How to increase Azure OpenAI concurrency?

1 answer

Your answer