Azure OpenAI and APIM - load balancing between instances with different deployments

Question

Azure OpenAI and APIM - load balancing between instances with different deployments

Jerzy Czopek 30

I'm exploring the possibilities and options to configure shared GenAI platform using Azure OpenAI and APIM. I know APIM has set of capabilities that work directly with the Azure OpenAI service, especially the load balancers.

However I'm unable to find answer for following question: is APIM able to load balance/route requests to Azure OpenAI backends, if there are differences in models/deployments(s) availability across those instances?

Azure OpenAI has different models available in different regions. Lets say I have 3 Azure Open AI instances with following models:

AOI_1 : gpt-4o,embedding-3-small
AOI_2 : gpt-4o,embedding-3-small,gpt-4o-mini
AOI_3 : gpt-4o,gpt-4o-mini

Does APIM know what are the deployments in each backend AOI service? If I send request to APIM for gpt-4o-mini deployment, it should only consider AOI_2 and AOI_3.
When I request embedding-3-small, request should be routed to AOI_1 and AOI_2?

What are the options/possibilities to implement in case it's not out of the box feature?

Thanks in advance!

YutongTie-MSFT 53,971 Reputation points Moderator

2024-09-05T19:56:32.8933333+00:00

Hello @Jerzy Czopek

Thanks for reaching out to us, from Azure OpenAI side, when you register your deployment to API management, you need to select the instance(different resource in different region will be different instance), model version, and also for Azure OpenAI Specification, deployment ID will be also a part of URL.

Document here - Add an OpenAPI specification to API Management - https://learn.microsoft.com/en-us/azure/api-management/azure-openai-api-from-specification#option-2-add-an-openapi-specification-to-api-management

So you should be able to choose which model you are using for each model. I hope this helps. I also add API management tag so that more expert with API management knowledge can help.

Regards,

Yutong
LeelaRajeshSayana-MSFT 17,766 Reputation points Moderator

2024-09-09T19:59:55.81+00:00

Hi @Jerzy Czopek Greetings! I am following up to see if you got a chance to review the below responses. Please let us know if you have any additional questions or need further assistance.

If the response helped, please do click Accept Answer and Yes for the answer provided. Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.
LeelaRajeshSayana-MSFT 17,766 Reputation points Moderator

2024-09-18T16:35:55.82+00:00

Hi @Jerzy Czopek We still have not heard back from you. Just wanted to check if the answer provided below was helpful? If it answers your query, please do click Accept Answer and Yes for the answer, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know. Thank you

2 answers

Your answer

YutongTie-MSFT 53,971 Reputation points Moderator

2024-09-05T19:56:32.8933333+00:00

Hello @Jerzy Czopek

Thanks for reaching out to us, from Azure OpenAI side, when you register your deployment to API management, you need to select the instance(different resource in different region will be different instance), model version, and also for Azure OpenAI Specification, deployment ID will be also a part of URL.

Document here - Add an OpenAPI specification to API Management - https://learn.microsoft.com/en-us/azure/api-management/azure-openai-api-from-specification#option-2-add-an-openapi-specification-to-api-management

So you should be able to choose which model you are using for each model. I hope this helps. I also add API management tag so that more expert with API management knowledge can help.

Regards,

Yutong
LeelaRajeshSayana-MSFT 17,766 Reputation points Moderator

2024-09-09T19:59:55.81+00:00

Hi @Jerzy Czopek Greetings! I am following up to see if you got a chance to review the below responses. Please let us know if you have any additional questions or need further assistance.

If the response helped, please do click Accept Answer and Yes for the answer provided. Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.
LeelaRajeshSayana-MSFT 17,766 Reputation points Moderator

2024-09-18T16:35:55.82+00:00

Hi @Jerzy Czopek We still have not heard back from you. Just wanted to check if the answer provided below was helpful? If it answers your query, please do click Accept Answer and Yes for the answer, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know. Thank you

Answer 1

Hi @Jerzy Czopek Thanks for posting the question here. This is an interesting scenario.

Looking at the implementation of Load balancing using Azure APIM to multiple instances of Azure Open AI services, we just provide the backend Id's as pool sources for load balancing. Please refer the following implementation sample of load-balancing outlined in the article Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service

User's image

This approach does not expose the underlying Open AI model implemented in the end points.

Please refer the document Intelligent Load Balancing with APIM for OpenAI for different options available and its approach towards load balancing.

It is also advised to have same model of Open AI endpoints when setting up load balancing as the current policies available (Roundrobin, Random, Priority, Weightage) do not have an option to configure what you are seeking.

However, if you decide to use a load balancer with multiple version on Open AI models, you may consider to use the set-backend-service policy to direct an incoming API request to an alternate backend. This approach uses a conditional logic and redirects the requests based on various parameters such as location, gateway that was called from, or other expressions such as versions. Once you this in place, you can have an additional check on the instance end point and route the incoming calls accordingly. Please refer to the documentation Reference backend using set-backend-service policy to know more about this.

Hope this answers your question. Please let us know if you need any additional information.

Answer 2

Hi Jerzy Czopek, yes you can mix and use various Azure OpenAI models / deployments behind the API-M.

More typical use-case is when you want to use a mix of GPT-3.5, GPT-4 and GPT-4o in various regions, e.g. to use EU-only Azure regions to stay data compliant. You can check how to implement something like this in my GitHub repo here: https://github.com/LazaUK/AOAI-APIM-AIGateway.

If you want to introduce more granular split, it would be required to adjust the policy in API-M with the new conditional check, e.g. to check the name of requested deployment and if you set specific naming convention (like GPT- or Emb- prefixes for deployments), then it would engage only specific backends sub-pool. The challenge with such scenario is its management, as your models / backends change over time.

Share via

Azure OpenAI and APIM - load balancing between instances with different deployments

2 answers

Your answer