What is the best way to decrease latency in Azure Open AI?

Question

K S, Aptha 20

I have below scenarios and I want to know is there any advantages in any of the scenarios.

A Single Azure OpenAI resource in East US for multiple teams in a resource group. Each team has a GPT 4o model with different names in this Single Azure OpenAI resource with a Global Standard deployment type and there own capacity requirement.
Multiple Azure OpenAI resources in East US for each team in a resource group. Each team has a GPT 4o model in there respective Azure OpenAI resource with a Global Standard deployment type and there own capacity requirement.

I know that global standard capacity is for the subscription. So Scenario 1 and 2 will have the same limits.

What I want to know is,

Is there any advantages of having multiple Azure OpenAI resource in a resource group (Scenario 2) especially in the area of latency compared to Scenario 1?
Dose Scenario 2 helps in caching (input tokens cache) compared to Scenario 1?
When multiple team calls there GPT 4o models, does Scenario 1 and 2 behave the same ?
Is there any other way to decrease latency of GPT 4o (other than provisioned managed) wrt multiple different teams using GPT 4o?

Accepted answer

0 additional answers

Answer 1

Hello,

Welcome to Microsoft Q&A,

Key Comparisons:

Factor	Scenario 1 (Single Resource)	Scenario 2 (Multiple Resources)
Latency	Increased queue time, if multiple teams use the resource heavily	Reduced latency due to isolation
Caching	Shared cache across teams	Per-resource cache (depends on usage patterns)
Management Overhead	Easier (single resource)	Higher (multiple resources)
Quota Management	Hard to enforce per-team quotas	Easier to allocate limits per team
Scalability	Limited by a single resource	More scalable with multiple deployments
Access Control	More complex (RBAC per deployment)	Simpler (RBAC per resource)

Does Scenario 2 improve latency compared to Scenario 1? Yes, in cases where multiple teams have high concurrent workloads, Scenario 2 provides better isolation and avoids potential queuing delays.
Does Scenario 2 improve caching (input token caching)? The caching mechanism is per resource, so if teams have similar workloads, a shared cache (Scenario 1) might be better. If their workloads are distinct, Scenario 2 could help by keeping independent caches per team.
When multiple teams call their GPT-4o models, do both scenarios behave the same? Mostly, because they still share the same global standard quota at the subscription level. However, in Scenario 2, one team’s high demand won’t impact others as much.
How can latency be reduced for multiple teams using GPT-4o (other than provisioned throughput)?
- Use regional deployments: Deploying in multiple Azure regions could improve response times for globally distributed teams.
- Optimize prompt design: Reduce token usage to speed up responses.
- Batch API calls: If applicable, use batching to improve throughput efficiency.
- Increase quota to reduce throttling.

Please upvote and accept the answer if it helps!