Load balancer for LLM models in Azure AI foundry

Guy Aronson 20 Reputation points
2025-02-24T09:59:36.02+00:00

Is load balancing possible for multiple LLM deployments in Azure AI Foundry? My app's LLMs are overloaded despite quota increases.

A load balancer between the LLM deployments would optimize it.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,604 questions
{count} votes

Accepted answer
  1. Prashanth Veeragoni 4,930 Reputation points Microsoft External Staff Moderator
    2025-02-25T08:19:53.45+00:00

    Hi Guy Aronson,

    Welcome to Microsoft Q&A forum. Thank you for posting your query.

    Yes, load balancing for multiple LLM deployments in Azure AI Foundry is possible, but Azure does not provide an out-of-the-box load balancer specifically for LLM models in AI Foundry. However, you can implement a custom load balancing approach using Azure services. Here’s how you can handle this issue effectively.

    Possible Reasons for Overloading Despite Quota Increases:

    Uneven distribution of requests: If all requests are hitting a single deployment, others might be underutilized.

    Rate limits per instance: Even with a quota increase, a single deployment has request rate limits.

    High latency: Some LLMs may be slower in response times, causing backlogs.

    Lack of request routing: If your app is not distributing requests effectively, it could be overwhelming specific deployments.

    Implementing Load Balancing for LLMs in Azure AI Foundry:

    Since Azure AI Foundry does not natively support LLM load balancing, consider these three approaches:

    Approach 1: Use Azure Front Door (Recommended)

    Approach 2: Use Azure Application Gateway with API Management

    Approach 3: Implement a Custom Load Balancer in Code

    Monitor endpoint health and retry if a server is unresponsive.

    For enterprise-scale deployments, the best approach is Azure Front Door + API Management. If you want a quick, cost-effective solution, try custom code-based load balancing.

    As Approach1 is recommended Let’s check the implementation

    Azure Front Door can act as a global load balancer and distribute requests across multiple LLM deployments.

     How to set up:

     Deploy multiple instances of the same LLM model in different regions.

     Configure Azure Front Door to route traffic between different endpoints.

     Use Weighted or Priority-based Routing:

     Weighted routing__:__ Distributes traffic proportionally to deployments.

     Priority-based routing__:__ Redirects traffic if a deployment fails.

    Hope this helps. Do let us know if you any further queries.   

    ------------- 

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    Thank you.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Guy Aronson 20 Reputation points
    2025-02-26T07:07:46.39+00:00

    Hi @Prashanth Veeragoni ,

    Yes the answer was helpful.

    Many thanks!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.