@Anonymous Thanks for reporting this.
Looks like this issue could be caused due to Azure outage on June 9th.
Preliminary Post Incident Review (PIR) - Azure Portal - Errors accessing the Azure portal
This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details/learnings.
What happened?
Between 15:10 UTC and 17:10 UTC on 9 June 2023, customers may have experienced error notifications when trying to access the Azure portal (portal.azure.com). Customers may also have experienced issues accessing other services built on the Azure portal, like the Microsoft Entra Admin Center (entra.microsoft.com) and Microsoft Intune (intune.microsoft.com).
What went wrong and why?
Our internal telemetry reported an anomaly with increased request rates, and the Azure portal displaying a “service unavailable” message in multiple geographies. Traffic analysis showed an anomalous spike in HTTP requests being issued against Azure portal origins, bypassing existing automatic preventive recovery measures and triggering the service unavailable response. We will share more details when our investigation is complete.
How did we respond?
We were alerted by our internal monitoring of the issue impacting availability of the Azure portal. Engineering teams across Azure portal and networking were engaged within 15 minutes and actively investigated the issue. The following actions were taken to mitigate the incident:
- Firewall rules were adjusted to block the traffic.
- Traffic throttling rules were adjusted to throttle the requests.
- Additional Azure portal server instances were added to handle increased load.
- Any unhealthy Azure portal instances were rebooted.
After applying the above mitigation steps, Azure portal availability continued to improve. Our internal monitoring started reporting a healthy state back to baseline at 17:10 UTC for all Azure portal endpoints.
How are we making incidents like this less likely or less impactful?
- Making the Azure portal more efficient so scale up runs more quickly (Completed).
- Increasing Azure portal scale to cope with high demand leads more efficiently (Completed).
- Blocking invalid requests and server responses more aggressively (Completed).
- Using proactive logic in adjusting traffic blocking and throttling rules (In Progress).
- Improving our internal Azure portal monitoring to detect such indicators more quickly and efficiently (In Progress).
- Making the Azure Portal startup process faster (In Progress).
How can customers make incidents like this less impactful?
- Azure Command-Line Interface (https://learn.microsoft.com/en-us/cli/azure/) and PowerShell (https://learn.microsoft.com/powershell/azure/get-started-azureps) were not impacted during this incident, and customers and partners could still use them to manage their Azure environments.
- Azure Resource Manager (ARM) REST APIs, as well as tools that are built using API endpoints and scripting, could be used to manage resources when the Azure portal was unavailable: https://learn.microsoft.com/rest/api/resources/
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/QNPD-NC8