Hi ,
Thanks for reaching out to Microsoft Q&A.
TLDR: To increase concurrency in Azure openai:
- Scale up your model deployment by increasing capacity units.
- Adjust quotas to allow for higher request and token limits.
- Optimize your application for concurrency and efficient resource utilization.
- Monitor and handle errors related to throttling or limits.
- Consult documentation and support to ensure all configurations align with best practices.
Detailed summary:
To increase the concurrency of your azure openai deployment and handle more than one request per second, you'll need to adjust several aspects of your setup.
Here are the steps to help you achieve higher throughput:
- Check Current Deployment Capacity:
- Capacity Units (CUs): azure OpenAI uses capacity units to determine the throughput and concurrency of your model deployments. By default, deployments may start with minimal capacity.
- Action: Navigate to your azure OpenAI resource in the Azure Portal and check the number of capacity units allocated to your
gpt-35-turbo
deployment.
- Scale Up Your Deployment:
- Increase Capacity Units:
- Effect: Allocating more capacity units will allow your deployment to handle more concurrent requests and higher overall throughput.
- Action: In the azure Portal, select your deployed model and adjust the number of capacity units to a higher value that matches your concurrency needs.
- Restart the Deployment:
- After scaling, ensure you restart the deployment for the changes to take effect.
- Review and Adjust Quotas:
- Default Quotas:
- azure OpenAI enforces default quotas on requests per minute and tokens per minute, which might be limiting your throughput.
- Action: Check your current quotas by navigating to the "Usage + quotas" section of your azure openAI resource.
- Request Quota Increase:
- If the default quotas are insufficient, submit a quota increase request via the Azure Portal:
- Go to your Azure subscription.
- Select "Usage + quotas".
- Find "Azure OpenAI Service".
- Click on "Request Increase" and fill out the necessary details.
- Optimize Your Node.js Application:
- Connection Management:
- Ensure that your application efficiently manages HTTP connections, possibly using connection pooling.
- Asynchronous Requests:
- Use asynchronous programming practices to handle multiple requests concurrently without blocking the event loop.
- Error Handling:
- Implement robust error handling to retry failed requests when appropriate.
- Monitor Throttling and Errors:
- Check Error Messages:
- Failed requests often return error messages indicating throttling or quota limits.
- Action: Log and review the error responses from azure OpenAI to identify if you're hitting specific limits.
- Use Batch Requests (if applicable):
- Batch Processing:
- If your use case allows, send batch requests to process multiple inputs in a single API call.
- Action: Modify your application to group requests and handle batch responses.
- Consult Azure OpenAI Documentation:
- Best Practices:
- Review the Azure OpenAI Service best practices for performance optimization.
- Scaling Guidance:
- Read about scaling deployments to understand how capacity units affect performance.
- Compare with OpenAI API Configuration:
- Configuration Differences:
- Since you mentioned better scaling with the OpenAI API, compare the configuration settings between your Azure OpenAI and OpenAI API implementations.
- Action: Align your Azure OpenAI deployment settings with those that work in your OpenAI API setup, adjusting for any platform differences.
By following these steps, you should be able to achieve higher concurrency with your azure OpenAI deployment, similar to the performance you're experiencing with the openAI API.
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.