Share via

Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST

Harpinder Singh 0 Reputation points
2026-04-28T05:55:48.2066667+00:00

We are running GPT series of models via foundry deployments. Our models are deployed in eastus2 region. Today at 2:30 am IST, the service started throwing 500 internal server errors

openai.InternalServerError: Error code: 500 - {'statusCode': 500, 'message': 'Internal server error', 'activityId': 'e8871bf8-c814-4253-97a6-848c233d17bc'}

openai.InternalServerError: Error code: 500 - {'statusCode': 500, 'message': 'Internal server error', 'activityId': '9912fd82-48e0-484f-94df-817b8b9cbcc9'}

I had to migrate from Azure in middle of the night. The issue continued at least until 3:15 AM IST, at which point I had successfully switched over.

I would like to understand the reason for the incident, and why it lasted so long, and why does https://azure.status.microsoft/en-us/status/history/ not show any incident history.

Azure OpenAI in Foundry Models

2 answers

Sort by: Most helpful
  1. Sina Salam 28,931 Reputation points Volunteer Moderator
    2026-05-11T17:04:57.95+00:00

    Hello Harpinder Singh,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST.

    The issue is a regional Azure OpenAI service disruption, the only correct long-term solution is to implement multi-region deployments with automatic failover, controlled retry logic, and Service Health monitoring retry alone is insufficient. - https://statusgator.com/services/azure/azure-openai-service

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    Was this answer helpful?

    0 comments No comments

  2. Jerald Felix 11,555 Reputation points Volunteer Moderator
    2026-04-29T00:08:48.3166667+00:00

    Hello Harpinder Singh,

    Greetings!

    Thanks for raising this question in Q&A forum.

    The 500 Internal Server Error you encountered on your Azure OpenAI (Foundry) deployments in the eastus2 region was most likely caused by a transient service-side outage or infrastructure disruption on the Azure OpenAI backend. These can happen due to platform-level issues such as compute node failures, model serving disruptions, or regional capacity problems — none of which are caused by anything on your end.

    Here's what you should know and what you can do going forward:

    1. Check Azure Service Health for your subscription: The public status page (azure.status.microsoft) only shows widespread, customer-impacting events. For region- or service-specific incidents that affect a subset of customers, go to the Azure Portal → Search "Service Health" → Check "Health Alerts" and "Health History" for your specific subscription. This often has more granular incident details that the public page doesn't show.
    2. Raise a Support Request for the incident timeline : Since you have the exact timestamp (Apr 28, 2:30 AM IST) and activity IDs from the error messages, raise an Azure Support ticket and include those activity IDs (like e8871bf8-c814-4253-97a6-848c233d17bc). Microsoft's support team can use these to trace exactly what happened on the backend during that window.
    3. Implement retry logic with exponential backoff: For future protection, add automatic retries in your code when a 500 error is received. This handles short transient spikes without requiring manual intervention.
    4. Set up fallback regions proactively: Since you had to migrate in the middle of the night, consider pre-configuring a secondary Azure OpenAI deployment in another region (e.g., eastus or swedencentral) and use a load balancer or API Management layer to failover automatically if one region goes down.
    5. Enable Azure Alerts: In the Azure Portal, set up Resource Health Alerts on your Azure OpenAI resource so you get notified immediately when a service degradation starts, rather than discovering it through errors.

    The fact that the issue lasted ~45 minutes and wasn't reflected on the public status page is unfortunately common for incidents that affect only a subset of customers or a specific deployment cluster. Your best bet for a formal post-incident report is through the support ticket route mentioned in Step 2.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.