Resiliency and continuity overview

How does Microsoft ensure business continuity if a disaster or other threat to service availability occurs?

Microsoft's Enterprise Resilience and Crisis Management (ERCM) team oversees business continuity management and disaster recovery activities across Microsoft services and cloud offerings. Representatives from Microsoft business units coordinate with the ERCM team to develop business continuity plans and validate compliance with business continuity requirements.

The Business Continuity Management (BCM) lifecycle is at the core of our BCM methodology. This three-phase process is designed to be adaptable so it can be implemented by a wide variety of business models across Microsoft. It begins with an Assessment phase to identify critical processes and objectives that should be included in the business continuity program. The Assessment phase also requires a Business Impact Analysis (BIA). The Planning phase focuses on developing and implementing resilience and recovery strategies and documenting them in official business continuity plans. Finally, Capability Validation tests business continuity plans and their implementations to verify effectiveness and identify potential improvements.

Microsoft online services business continuity strategies use hardware, network, and datacenter redundancy. Data replication between datacenters provides high availability and reliability during a catastrophic incident. It also increases resilience to mundane incidents such as isolated hardware failure or data corruption.

How does Microsoft test business continuity and disaster recovery plans?

Microsoft's Enterprise Resilience and Crisis Management (ERCM) policy stipulates that all Microsoft business continuity and disaster recovery plans must be tested, updated, and reviewed on an annual basis. Microsoft online services test their business continuity plans at least annually per ERCM policies. After Action reports are created and reviewed to validate, test results and inform plan updates in response to any problems discovered during testing.

To validate resilience and recovery strategies against a wide range of potential incidents, the ERCM Program defines multiple categories of test scenarios affecting people, locations, and technology. The level of validation required for each service is based on the service's criticality, with more critical services receiving more rigorous validation. Each Microsoft online service team tests their business continuity plan according to ERCM guidelines to measure the plan's effectiveness and the service team's readiness to execute the plan.

Per ERCM guidelines, annual reviews of business continuity plans and capability validation must take place within 12 months of the last review. Capability validation must include review of supporting documentation, such as the BIA, to ensure it remains accurate. Microsoft makes capability validation results for select Microsoft online services available to our customers through quarterly reports.

How do Microsoft online services ensure system capacity meets demand?

Capacity planning helps service teams allocate the resources necessary to support Microsoft online service availability. Regular capacity planning is required as part of Microsoft's ERCM program. Service teams review capacity data during quarterly reviews, and during emergency situations that warrant more capacity review.

The raw data for capacity planning is maintained by each service team and includes metrics like system processing, memory, and hardware capacity. Scheduled reviews use a model of the system's current capacity and test it against projected needs in emergency situations. If the model indicates gaps in capacity, proposed changes to system capacity are submitted to service team leadership for review. Approved changes are incorporated into a new model before implementation by service team engineers.

How do Microsoft online services maintain service availability during routine system failures?

Microsoft online services achieve service resilience through redundant architecture, data replication, and automated integrity checking. Redundant architecture involves deploying multiple instances of a service on geographically and physically separate hardware, providing increased fault-tolerance for Microsoft online services. Data replication ensures there are always multiple copies of customer data in different fault-zones, allowing critical customer data to be recovered if corrupted, lost, or even accidentally deleted by the customer. Automated integrity checking increases data availability by automatically restoring data impacted by many kinds of physical or logical corruption.

Microsoft's online services are regularly audited for compliance with external regulations and certifications. Refer to the following table for validation of controls related to resiliency and continuity.

Azure and Dynamics 365

External audits Section Latest report date
ISO 27001/27002

Statement of Applicability
A.17.1: Information security continuity
A.17.2: Redundancies
April 24, 2023
ISO 22301

All controls April 24, 2023
BC-1: Business continuity plans
BC-3: Business continuity and disaster recovery procedures
BC-4: BCDR testing
BC-7: Datacenter business continuity plans
BC-8: Datacenter business continuity testing
BC-9: Datacenter resiliency assessment
DS-5: Backup key service components
DS-6: Redundancy of critical components
DS-7: Automatic replication of customer data
DS-8: Backup schedule
DS-9: Backup restoration procedures
DS-11: Offsite backups
DS-14: Automatic restoration of customer services
August 24, 2023

Microsoft 365

External audits Section Latest report date
FedRAMP (Office 365) CP-2: Contingency plan
CP-3: Contingency training
CP-4: Contingency plan testing
CP-6: Alternate storage site
CP-7: Alternate processing site
CP-9: Information system backup
CP-10: Information system recovery and reconstitution
July 31, 2023
ISO 27001/27002

Statement of Applicability
A.17.1: Information security continuity
A.17.2: Redundancies
March 2023
ISO 22301
All controls March 2023
CA-49: Backup policies
CA-50: Business continuity
CA-51: Data replication
January 3, 2023
SOC 3 CUEC-09: EXO email restoration January 3, 2023