Microsoft Azure Incident Readiness - Unified

When an Azure incident is declared, we communicate updates to impacted subscriptions or tenants via the Service Issues blade in Azure Service Health (within the Azure portal).

Before an incident

We recommend the following steps to be prepared and help protect your organization:

Get notified and stay updated for incidents affecting your Azure services

  1. Get familiar with Azure Service Health in the Azure portal – your ‘go to’ place in case of issues.

  2. Configure Service Health alerts alerts to notify you about any issues – by email, SMS, webhook, etc. at the subscription level, by service(s), and / or by region(s).

    • Service issues notification type will alert your organization that your services are impacted by service incidents.

    • Security advisory notification type will alert your organization that your services are impacted by either a security incident or privacy incident.

    Here are foundational alert configuration recommendations:

    • For Service issues, Planned Maintenance & Health Advisory types:

      • Your critical workloads – setup alerts for your subscription(s) & service(s) that power your critical workload(s).
      • Setup alerts for foundational services in the Azure stack:
        • "Network Infrastructure" service – foundational layer in the Azure stack that all types of workloads & applications from IaaS to SaaS rely on.
        • "Microsoft Azure portal" service – foundational service used to manage Azure resources. Its versatility positions it as a ‘catch-all’ service, covering a variety of scenarios, impact summary experiences which will be communicated under this service.
    • For Security Advisories type:

      • All Azure subscriptions and services – typically bad actor(s) target less used resources, so it’s important this type of alert covers all Azure resources

    Additionally, Azure Monitor Baseline Alerts solution provides comprehensive guidance and code for implementing a baseline of platform alerts as well as service health alert via policies and initiatives in Azure environments, with options for automated or manual deployment.

  3. Ensure the following roles have the right contact information and are reviewed regularly to stay current. For more information, please review Stay informed about Azure security issues - Azure Service Health | Microsoft Learn)

    • Subscription Administrator and Subscription Owner – contacts that will be used to receive notifications (via Azure Portal and/or email, depending on the communication requirements) for security issues impacting at the subscription level.

    • Tenant Global Admin and Technical contact – contacts that will be used to receive notifications (via Azure Portal and/or email, depending on the communication requirements) for security issues impacting at the tenant level.

    • Security admin – can review and make changes to the security policy, apply recommendations, and view and dismiss alerts.

  4. Consider using Health Alerts or Scheduled Events to stay informed about -specific issues so that your people and systems can be informed about -specific issues and upcoming maintenance events.

To understand Azure’s communication principles, please review the Advancing the outage experience—automation, communication, and transparency | Azure Blog and Updates | Microsoft Azure.

Increase your security and resiliency posture to potentially avoid or minimize impact of incidents

  1. Review and implement the Operational Security Best Practices for protecting your data, applications, and other assets, especially these:

    • Enforce Multi-Factor Authentication to alleviate concerns about exposure.

    • Implement alerts for High Risk users. Configure conditional access to ensure you are notified when there is a “risky user” in your environment.

    • Control the movement of subscriptions from and into directories. For governance purposes, global administrators can allow or disallow directory users from changing the directories that are unknown within their organization. This ensures that your organization has full visibility into the subscriptions that are used under your organization’s directories and prevents movement of subscriptions that could go to an unknown directory.

  2. Optimize critical workload reliability, security & more using the Azure Well-Architected Framework (WAF) and Review. Please also consider these actions to compliment the work in the WAF.

    • Leverage the Reliability workbook, which is integrated into the Azure portal under Azure Advisor blade, to review the reliability posture of your applications, assess risks and plan improvements.

    • Expand workload/ deployments cross regions for business continuity and disaster recovery (BCDR). Use the published full list of Azure region pairs.

    • Expand workload/ deployments within a region across Availability Zones.

    • Consider Isolation for VMs in Azure - Azure Virtual Machines | Microsoft Learn for business-critical workloads.

    • Consider Maintenance Configurations for the ability to  control and manage updates for many Azure virtual machines

    • Use Azure Chaos Studio to evaluate your Azure apps resiliency. Subject your Azure apps to controlled faults, real or simulated, to observe application resiliency and response to disruptions such as network latency, storage outage, expiring secrets, and datacenter outage.

    • Utilize the Service Retirement Workbook, which is integrated into the Azure portal under Azure Advisor blade, as your single centralized resource level view of service retirements. It helps you assess impact, evaluate options, and plan for migration from retiring services and features.

Please follow the Azure's Advancing Reliability Blog to stay up to date with Azure efforts on continuous resiliency efforts.

During an incident

When your key subscriptions are impacted by an incident, it is important that you know where and how to find the relevant communications surrounding this incident:

  1. Review Azure Service Health alerts in the Azure portal for the latest updates from our engineers.

    • It is important to note that specific role contacts mentioned in the ‘before an incident’ section (I.e. subscription administrator / owner, technical / privacy contact, tenant admin) may also get email notifications for security or privacy incidents.
  2. If there are issues accessing the portal, check the public Azure status page azure.status.microsoft as a backup.

  3. If there are ever issues with the Status page, check for any updates via @AzureSupport on "X" (formerly Twitter).

Why use Service Health instead of the public Status page?

Many customers check our publicly-accessible status pages (like azure.status.microsoft) at the first signs of potential issues, to see if there are known issues with our cloud services. These pages only show widespread issues that meet certain criteria, not smaller incidents that impact fewer customers.

Azure Service Health (within the Azure portal) knows which subscriptions and tenants you manage, so it shows a much more accurate view of any known issues impacting your outage. It also lets you configure alerts, so that you can be notified automatically.

When is it useful to open a support case?

If the service incident is already being communicated via Service health, all the latest information will be provided here, and there is no need to open a support request. If you believe you’re impacted by a service incident but do not see the issue represented in the Service health page, please open a support request.

If there are questions not covered by security issue materials received, please open a support request referencing the tracking ID.

After an incident

  1. Read the Post Incident Review (PIR) from the Health history pane of Azure Service Health (or via customer-configured Service Health alerts) to understand what we learned.

  2. For major incidents that met our public Status page criteria, join an Azure Incident Retrospective livestream to get any questions answered, or watch the recording.

  3. If you think you may be eligible for an SLA credit, create a new support request with a problem type of "Refund Request" – and include the incident Tracking ID.