凯旋 李, you can turn the Azure OpenAI endpoint into a simple Web App that just forwards requests, then put API Management in front to check each caller’s Azure AD token, grab their UPN from the token, and count how many “tokens” (text chunks) they use. When someone goes over your limit, APIM will stop them before it ever reaches OpenAI.
First, wrap OpenAI in a proxy Web App. You can start with the official Azure AI Foundry sample or build a tiny ASP .NET Core Web API that reads two settings (your OpenAI endpoint and key) and simply relays any JSON you post to /chat/completions
on the OpenAI REST API. Deploy that to App Service and make sure it lives at, for example, https://my-openai-proxy.azurewebsites.net
.
Next, set up Azure AD so that users sign in and your client (whether it’s MSAL.js in a browser or Azure.Identity in server code) can request a token for the scope you exposed on your APIM API, for example api://<APIM-API-ID>/access_as_user
. Every time you call your proxy, attach that token in the header.
var credential = new InteractiveBrowserCredential();
var token = await credential.GetTokenAsync(
new TokenRequestContext(new[]{ "api://<API-ID>/.default" }));
httpClient.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("Bearer", token.Token);
var result = await httpClient.PostAsync(
"https://<your-apim>.azure-api.net/openai/chat/completions",
new StringContent(jsonPayload, Encoding.UTF8, "application/json"));
With the Web App ready, import it into API Management as a backend API. In the Inbound section of the APIM policy, first validate the bearer token against the tenant and audience scope then pull out the user’s upn
claim then apply an llm-token-limit
policy keyed on that UPN so APIM keeps a running tally of how many tokens each user spends. Finally, forward the request to the proxy.
<policies>
<inbound>
<base />
<validate-jwt header-name="Authorization"
require-scheme="Bearer"
output-token-variable-name="jwt">
<openid-config url="https://login.microsoftonline.com/<TENANT_ID>/v2.0/.well-known/openid-configuration"/>
<audiences>
<audience>api://<API-ID></audience>
</audiences>
<required-claims>
<claim name="scp" match="any">
<value>access_as_user</value>
</claim>
</required-claims>
</validate-jwt>
<set-variable name="userUpn"
value="@(context.Principal.GetClaimValue("upn")
?? context.Principal.GetClaimValue("preferred_username"))" />
<llm-token-limit tokens-per-minute="10000"
token-quota="4000"
token-quota-period="Hourly"
counter-key="@(context.Variables["userUpn"])"
estimate-prompt-tokens="true"
remaining-quota-tokens-header-name="x-remaining-tokens"
tokens-consumed-header-name="x-consumed-tokens" />
<set-backend-service id="openai-proxy"
backend-id="your-openai-proxy-service" />
</inbound>
<backend>
<forward-request />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
When you call the APIM URL with a valid bearer token, you will see x-remaining-tokens
and x-consumed-tokens
in the response headers. If a user spends more than 4 000 tokens in an hour, APIM automatically rejects further requests (429 or 403) until the next hour. No extra keys or manual subscriptions are needed just the users’ Azure AD identities.