Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST

Question

Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST

Harpinder Singh 0

We are running GPT series of models via foundry deployments. Our models are deployed in eastus2 region. Today at 2:30 am IST, the service started throwing 500 internal server errors

openai.InternalServerError: Error code: 500 - {'statusCode': 500, 'message': 'Internal server error', 'activityId': 'e8871bf8-c814-4253-97a6-848c233d17bc'}

openai.InternalServerError: Error code: 500 - {'statusCode': 500, 'message': 'Internal server error', 'activityId': '9912fd82-48e0-484f-94df-817b8b9cbcc9'}

I had to migrate from Azure in middle of the night. The issue continued at least until 3:15 AM IST, at which point I had successfully switched over.

I would like to understand the reason for the incident, and why it lasted so long, and why does https://azure.status.microsoft/en-us/status/history/ not show any incident history.

Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-04-30T05:00:20.3433333+00:00
Hi Harpinder Singh,

Thanks for sharing this issue. Getting 500 errors suddenly without any change can be confusing, but this usually relates to service side behavior rather than your code.

Let me explain in a simple way and share what you can check.

First, what this error means A 500 error means the request reached Azure OpenAI successfully, but something failed internally on the service side.

In most cases, this is not caused by your request or configuration. It usually indicates

temporary service interruption

backend issues in the region

infrastructure or capacity problem

In your case, since it started at a specific time and affected working calls, it is very likely a temporary backend issue in that region.

Second, why the status page may not show it. The public Azure status page only shows major incidents that affect many customers globally.

If the issue is

limited to a region

affecting only some deployments

then it may not appear there.

For more detailed information, Service Health in Azure portal usually shows region level or subscription level alerts.

Third, common causes of 500 errors Based on similar cases, 500 errors usually fall into two categories

Service side issues

temporary outages

model serving disruptions

high load or capacity constraints

Request related triggers

very large prompts or responses

malformed or invalid output from model

content filtering silently blocking some inputs

These are known patterns seen in Azure OpenAI scenarios.

Fourth, things you can do to handle this

Add retry logic This is the most important practice

500 errors are often temporary, so retrying after a short delay usually works

Using exponential retry makes your application more stable for these cases.

Check Service Health in Azure portal

go to Service Health

check health alerts and history

This gives more accurate insight compared to public status pages.

Validate your requests Even though this looks like a service issue, still check

input size is not too large

prompt is not overly long

output format is valid

Sometimes specific inputs can trigger internal failures.

Test with simple request Try a small and simple prompt

if simple requests work but large ones fail, then it may be input related

if nothing works, it is likely service side

Use multi region setup if possible If your application is critical, using another region as fallback helps reduce impact when one region has issues

Fifth, important point When errors start suddenly without any changes and affect multiple requests, it strongly points to a backend or regional issue, not something from your code

This matches your scenario where it started at a specific time and continued for some time.

Official reference for monitoring You can check Service Health:

https://learn.microsoft.com/en-us/azure/service-health/

In short

500 error usually means temporary server issue

often not caused by your configuration

retry logic is the best immediate handling

check Service Health for region level details

This type of issue usually resolves automatically once the backend stabilizes

Hope this helps clarify the situation and gives you a clear way. Do let me know if you have any further queries.

Thankyou!
Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-05-01T09:26:21.21+00:00

Hello Harpinder Singh,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thankyou!
Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-05-04T02:59:35.0733333+00:00

Hello Harpinder Singh,

Please let me know if the issue persists after these checks. If you have any remaining questions or need additional details, I’ll be glad to provide further clarification or guidance. If the above steps resolve your issue, kindly confirm.

Thankyou!

2 answers

Your answer

Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-05-01T09:26:21.21+00:00

Hello Harpinder Singh,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thankyou!
Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-05-04T02:59:35.0733333+00:00

Hello Harpinder Singh,

Please let me know if the issue persists after these checks. If you have any remaining questions or need additional details, I’ll be glad to provide further clarification or guidance. If the above steps resolve your issue, kindly confirm.

Thankyou!

Answer 1

Hello Harpinder Singh,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST.

The issue is a regional Azure OpenAI service disruption, the only correct long-term solution is to implement multi-region deployments with automatic failover, controlled retry logic, and Service Health monitoring retry alone is insufficient. - https://statusgator.com/services/azure/azure-openai-service

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer 2

Hello Harpinder Singh,

Greetings!

Thanks for raising this question in Q&A forum.

The 500 Internal Server Error you encountered on your Azure OpenAI (Foundry) deployments in the eastus2 region was most likely caused by a transient service-side outage or infrastructure disruption on the Azure OpenAI backend. These can happen due to platform-level issues such as compute node failures, model serving disruptions, or regional capacity problems — none of which are caused by anything on your end.

Here's what you should know and what you can do going forward:

Check Azure Service Health for your subscription: The public status page (azure.status.microsoft) only shows widespread, customer-impacting events. For region- or service-specific incidents that affect a subset of customers, go to the Azure Portal → Search "Service Health" → Check "Health Alerts" and "Health History" for your specific subscription. This often has more granular incident details that the public page doesn't show.
Raise a Support Request for the incident timeline : Since you have the exact timestamp (Apr 28, 2:30 AM IST) and activity IDs from the error messages, raise an Azure Support ticket and include those activity IDs (like e8871bf8-c814-4253-97a6-848c233d17bc). Microsoft's support team can use these to trace exactly what happened on the backend during that window.
Implement retry logic with exponential backoff: For future protection, add automatic retries in your code when a 500 error is received. This handles short transient spikes without requiring manual intervention.
Set up fallback regions proactively: Since you had to migrate in the middle of the night, consider pre-configuring a secondary Azure OpenAI deployment in another region (e.g., eastus or swedencentral) and use a load balancer or API Management layer to failover automatically if one region goes down.
Enable Azure Alerts: In the Azure Portal, set up Resource Health Alerts on your Azure OpenAI resource so you get notified immediately when a service degradation starts, rather than discovering it through errors.

The fact that the issue lasted ~45 minutes and wasn't reflected on the public status page is unfortunately common for incidents that affect only a subset of customers or a specific deployment cluster. Your best bet for a formal post-incident report is through the support ticket route mentioned in Step 2.

If this answer helps you kindly accept the answer which will help others who have similar questions.

Best Regards,

Jerald Felix.

Share via

Azure openai calls started throwing 500 errors since Apr 28, 2026 at 2:30:13.876 am IST

2 answers

Your answer