What is the best way to decrease latency in Azure Open AI?

K S, Aptha 20 Reputation points
2025-02-09T17:10:48.19+00:00

I have below scenarios and I want to know is there any advantages in any of the scenarios.

  1. A Single Azure OpenAI resource in East US for multiple teams in a resource group. Each team has a GPT 4o model with different names in this Single Azure OpenAI resource with a Global Standard deployment type and there own capacity requirement.
  2. Multiple Azure OpenAI resources in East US for each team in a resource group. Each team has a GPT 4o model in there respective Azure OpenAI resource with a Global Standard deployment type and there own capacity requirement.

I know that global standard capacity is for the subscription. So Scenario 1 and 2 will have the same limits.

What I want to know is,

  • Is there any advantages of having multiple Azure OpenAI resource in a resource group (Scenario 2) especially in the area of latency compared to Scenario 1?
  • Dose Scenario 2 helps in caching (input tokens cache) compared to Scenario 1?
  • When multiple team calls there GPT 4o models, does Scenario 1 and 2 behave the same ?
  • Is there any other way to decrease latency of GPT 4o (other than provisioned managed) wrt multiple different teams using GPT 4o?
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,098 questions
0 comments No comments
{count} votes

Accepted answer
  1. Divyesh Govaerdhanan 6,400 Reputation points
    2025-02-09T23:09:33.5166667+00:00

    Hello,

    Welcome to Microsoft Q&A,

    Key Comparisons:

    Factor Scenario 1 (Single Resource) Scenario 2 (Multiple Resources)
    Latency Increased queue time, if multiple teams use the resource heavily Reduced latency due to isolation
    Caching Shared cache across teams Per-resource cache (depends on usage patterns)
    Management Overhead Easier (single resource) Higher (multiple resources)
    Quota Management Hard to enforce per-team quotas Easier to allocate limits per team
    Scalability Limited by a single resource More scalable with multiple deployments
    Access Control More complex (RBAC per deployment) Simpler (RBAC per resource)
    1. Does Scenario 2 improve latency compared to Scenario 1? Yes, in cases where multiple teams have high concurrent workloads, Scenario 2 provides better isolation and avoids potential queuing delays.
    2. Does Scenario 2 improve caching (input token caching)? The caching mechanism is per resource, so if teams have similar workloads, a shared cache (Scenario 1) might be better. If their workloads are distinct, Scenario 2 could help by keeping independent caches per team.
    3. When multiple teams call their GPT-4o models, do both scenarios behave the same? Mostly, because they still share the same global standard quota at the subscription level. However, in Scenario 2, one team’s high demand won’t impact others as much.
    4. How can latency be reduced for multiple teams using GPT-4o (other than provisioned throughput)?
      • Use regional deployments: Deploying in multiple Azure regions could improve response times for globally distributed teams.
      • Optimize prompt design: Reduce token usage to speed up responses.
      • Batch API calls: If applicable, use batching to improve throughput efficiency.
      • Increase quota to reduce throttling.

    Please upvote and accept the answer if it helps!


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.