Hi bfirkovskiy,
Please confirm that your notification hubs are working okay now. Yesterday I tested mine in East US and it seems to be working fine. Below is latest update as of this writing:
What happened?
Between 06:52 UTC and 21:45 UTC on 03 June 2025, a platform issue impacted underlying service instances impacting Notification Hubs in the East US region. Customers experienced errors when sending notifications to recipients hosted in this region. Notifications or registrations attempted during the impact window will not be recovered, as the compute layer responsible for storing them was unavailable.
What do we know so far?
We identified that the service instances responsible for processing requests had become unhealthy. This unavailability of the compute layer prevented customer notifications from being processed correctly, resulting in the service disruption described above.
How did we respond?
- 06:00 UTC on 03 June 2025 – We received an alert via internal service telemetry indicating Notification Hub availability degradation in the East US region.
- 06:52 UTC on 03 June 2025 – Customer impact began.
- 07:00 UTC on 03 June 2025 – We identified an Active cluster as not being able to serve the traffic in the East US region.
- 08:30 UTC on 03 June 2025 – Our team's investigation found that the issue was with service instances on unhealthy cluster nodes. We engaged other internal teams to work on mitigating efforts to get the cluster running again.
- 13:00 UTC on 03 June 2025 – After further investigation, a parallel mitigation workstream was begun to build a new cluster. Work also continued to recover the previous cluster.
- 20:41 UTC on 03 June 2025 – The parallel cluster was deployed and transitioned to an active status to handle Notifications and traffic in East US. Customers should have experienced increasing success from this point. Teams monitored closely to validate complete mitigation.
- 21:45 UTC on 03 June 2025 – After a period of observation and confirmation from customers, we confirmed that Notification Hubs service requests had returned to pre-incident levels and customer impact was mitigated.
What happens next?
- Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
- To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
- For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness
Please click Accept Answer if the above was helpful.
Thanks.
-TP