Agent Service in Sweden Central are out of service

HAL9000 0 Reputation points
2026-01-29T13:27:25.6733333+00:00

All agents hosted in agent service v1 of my Foundry resources in Sweden Central disappeared. Trying to create a new one via Portal UI or via SDK gives an HTTP 500.

The agent service v1 in Foundry I have in West Europe works correctly

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
{count} votes

2 answers

Sort by: Most helpful
  1. Sina Salam 27,791 Reputation points Volunteer Moderator
    2026-01-30T12:34:30.1966667+00:00

    Hello HAL9000,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Agent Service in Sweden Central are out of service.

    For clarity about timeframe confusion (27th vs 28–29th), status page lag vs user experience and underestimation of downstream service coupling (OpenAI > Foundry Agents) are misconception.

    In Sweden Central, Foundry Agent Service v1 shows agents missing and creation returns HTTP 500, persisting after the Jan 27 mitigation window. Therefore, we must (a) restore working agents, (b) recover/validate metadata, and (c) prevent recurrence with EU‑resident resilience.

    If you need immediate restore service for production workloads then:

    The traffic will be unblocked in West Europe immediately, removing Sweden Central from the critical path while you repair it.

    Also, you can recover Sweden Central Agent Service v1 state (the “agents disappeared” issue) by implementing force a control‑plane refresh and re‑enumeration of agents, re‑attach downstream resources/tools and re‑save each agent configuration, and if agents still don’t list or load cleanly, re‑create from source of truth.

    In addition, you can make recurrence operationally harmless by implementing an Active‑Active within the EU (West Europe + North Europe or West Europe) to keep redundant Agents + model deployments in two EU regions and front them with Azure Front Door or Traffic Manager using health probes and weighted/failover routing. This is directly consistent with Azure’s own workaround guidance (route to alternative regions during OpenAI incidents) and industry best practice highlighted after the Jan 27 outage. - https://rssfeed.azure.status.microsoft/en-us/status/feed/

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

  2. SRILAKSHMI C 14,140 Reputation points Microsoft External Staff Moderator
    2026-01-29T14:00:00.36+00:00

    Hello HAL9000,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    As per Team, What happened?

    Between 09:22 UTC and 16:12 UTC on 27 January 2026, a platform issue resulted in an impact to the Azure OpenAI Service in Sweden Central region. Impacted customers may have seen HTTP 500/503 errors, failed inference requests, and issues with model deployment metadata. This issue also affected downstream AI Services dependent on Azure OpenAI in this region.

    What do we know so far?

    Our initial investigation indicates that the issue may be related to elevated error handling within one of our production model dependencies. This temporarily affected request processing, specifically authorization of the incoming requests, causing intermittent service degradation impacting request success rates. We mitigated the issue by stabilizing traffic flow and adjustments to improve request handling and resilience, validating system health, and monitoring recovery to ensure normal operation has been restored.

    How did we respond?

    09:22 UTC on 27 January 2026 – The issue was detected through service monitoring which is also when customers began to see intermittent availability issues were observed. 12:36 UTC on 27 January 2026 – Initiated mitigation to restart the IRM service on the Sweden Central clusters. 12:46 UTC on 27 January 2026 – Identified that Sweden Central cluster is seeing pods crashing with out-of-memory errors. 13:02 UTC on 27 January 2026 – Initiated mitigation workflow by scaling out nodes in the cluster to improve request handling and resilience. 15:30 UTC on 27 January 2026 – Started to increase the memory available in the pods to alleviate memory load on the cluster. 15:53 UTC on 27 January 2026 – Completed increase in memory in the pods to alleviate memory load on the cluster. 16:12 UTC on 27 January 2026 – Service(s) restored, and customer impact mitigated. What happens next?

    Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

    • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
    • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

    Thank you!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.