Trying to adjust model deployment capacities via ARM template

Curtis, Colin 0 Reputation points
2024-09-24T16:46:18.2333333+00:00

My team has two deployments of a single model type in a given subscription + region instance (SRI). Those two deployments consume the entire quota for the given model in the given SRI but now we want to shift the distribution of throughput between the two models (increase one, decrease the other). Naively, I attempted to merely change the "capacity" values in the ARM template and re-run the pipeline but this produced the following error:

{"code": "InsufficientQuota", "message": "This operation require 40 new capacity in quota Tokens Per Minute (thousands) - GPT-4-32K, which is bigger than the current avail

Reading the Azure docs, I believe that this error comes about because the way the az cli attempts to parse/use the ARM template, there is no attempt to consider the final state but rather whether each operation can, in and of itself, succeed. Thus, so far, I have been unable to make these desired changes using our ARM template + Jenkins pipeline (which calls the az cli under-the-hood).

To put the issue a different way: In the Azure OAI interface, I can go to a specific deployment of a model and hit the “Edit Deployment” button which will let me dynamically adjust the “Tokens per Minute Rate Limit” in-place. For our use-case, this is not a viable means of control in a Production Environment. This behavior is what we wish to replicate in an infrastructure-as-code manner! Do you know how I may accomplish this via ARM template and/or API?

Just had the thought - perhaps a Deployment Script (https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-script-template) might facilitate this? Open to any suggestions, thanks.

Azure API Management
Azure API Management
An Azure service that provides a hybrid, multi-cloud management platform for APIs.
2,144 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.