Strengthening operational resilience and reducing concentration risk in financial services

2025-04-07

This article builds upon our earlier blog which provides practical guidance on how to strengthen operational resilience and manage concentration risk in financial services institutions (FSIs).

To ensure consistency with existing and upcoming regulations, we incorporated into our approach the main regulatory guidance for FSIs, including ongoing consultations on the topic of operational resilience.

These include:

Our approach is consistent with the listed publications and will be updated over time. Our objective is to help customers grow and innovate in a responsible and compliant way, in line with existing, and future FSI regulations.

Overview

When it comes to the use of cloud technology in FSIs, operational resilience and concentration risk are interlinked because it's addressed by the strengthening of operational resilience in various regulations. Both are concerned with a wide set of measures including the identification and monitoring of critical third-party relationships, requirements to strengthen risk governance and management, and guidance around business continuity and exit planning.

We aim to help FSIs address and strengthen their operational resilience to help them manage concentration risk at the enterprise level in a way that is consistent with regulatory guidelines. It's important that this topic is addressed holistically and in consideration of all third-party relationships. The use of cloud services and dependencies on on-premises software products must be maintained and kept secure. In addition, subcontracting in relation to critical functions is also commonly addressed in regulatory guidance.

In the 22 June 2023 FSB Consultation, a critical service was defined as a service whose failure or disruption could significantly impair a financial institution’s viability, critical operations, or its ability to meet key legal and regulatory obligations. It's primarily these services that are in focus when it comes to managing concentration risk.

Concentration must be distinguished from concentration risk. Concentration of third-party dependencies is commonplace today across many critical FSI services, and it may not be feasible to eliminate it. For instance, if you consider the broad usage of financial information networks (for example, Bloomberg or Thomson Reuters) and software publishers such as IBM and Oracle, it becomes clear that complete substitution would be challenging. One can devise an exit plan for a specific third-party service, but it becomes difficult to discontinue all use of a third-party solution for large financial institutions. Therefore, the focus should be on reducing concentration risk so that it stays within an organization’s risk tolerances.

While some of these examples are already systemically important, they aren't often perceived as carrying elevated levels of concentration risk because they're distributed across multiple locations that are geographically dispersed. As a result, most failures only have a limited (nonglobal) impact. They also offer the highest levels of resilience through state-of-the-art design such as secure and resilient operations, use of automation, and implementation of Zero Trust. These designs have been confirmed by multiple independent third parties through various certifications processes. This collective set of measures has brought down the overall risk profile even though, at a firm level, concentration is still in place.

FSIs must focus on strengthening operational resilience, and it must be done in a way that is consistent with regulations and guidelines. It’s important to note that regulations don't force FSIs to deploy hybrid or multicloud solutions but instead take a principled, risk-based, and technology neutral approach to addressing concentration risk. These risks may also exist in an on-premises environment.

Beyond operational aspects, regulations also highlight nonoperational risks in relation to outsourcing, in particular financial insolvency risks and resolution approaches. Nonoperational risks can't be managed through technological measures but instead are addressed with legal measures during the contracting phase and by invoking the exit strategy.

A 6-step approach for managing operational risk in financial services

There are several elements that must be considered before making critical choices on how to manage operational risks, which is why we introduced a six-step approach to addressing concentration risk and strengthening operational resilience:

Managing operational resilience in financial services

Step 1 - Update cloud risk governance

Cloud technology in financial services regulation today is mainly governed by third-party outsourcing guidelines that apply specifically in this context, with different sets of regulation applying to the on-premises ICT environment. More recent consultations, as referenced in the introduction, take a holistic approach when it comes to strengthening operational resilience. For instance, draft regulations such as DORA and the DORA ICT RMF won't only apply to on-premises but will also consider the use of ICT third-party service providers as an integral part of its framework.

Additionally, as described in the June 22, 2023 FSB Consultation, the primary focus is on critical services. We also see this in other regulations such as DORA. The focus on critical services is necessary because otherwise the scope of services to be assessed becomes too broad as technology services become increasingly interconnected.

Therefore, we recommend that firms revisit internal risk governance frameworks and holistically incorporate these new operational resilience measures with a focus on critical services, incorporating various guidelines on third-party or cloud outsourcing to ensure compliance.

Some appropriate questions to ask in this context include:

Have clear organizational risk tolerances been defined?
Which services are business critical?
What plausible to real threat scenarios may impact these?
What is my firm’s overall risk appetite?

Risk governance frameworks must also be updated annually to ensure continued compliance in a rapidly evolving regulatory landscape.

The regulations and guidelines from the introduction align to these questions as a starting point when it comes to strengthening operational risk. To account for these requirements across jurisdictions, Microsoft has created an extensive set of Financial Services Compliance Checklists to help customers self-assess regulations in various countries/regions when it comes to the use of cloud technology. These checklists feature regulatory mappings and point to specific information relevant when assessing the use of Microsoft cloud technology.

In addition, the Compliance Program for Microsoft Cloud has been designed to help risk and compliance functions across all three lines of defense in complying with these regulations and in addressing overall risks related to the use of cloud. The program offers both proactive and reactive features, offering a premium support channel for risk stakeholders.

Step 2 - Identify concentration

The higher the level of concentration of services with a single third-party provider, the higher the potential of adverse impact if something goes wrong, which is referred to as concentration risk. This risk is why organizations need to have a clear understanding of all dependencies between business processes, ICT platforms, software, and third-party relationships.

Mapping these dependencies is typically performed as part of the business impact analysis (BIA). Once all third-party relationships have been identified and mapped to critical use cases, it becomes possible to identify the level of concentration of critical services with a single third-party provider. This view enables firms to identify where they must focus when managing concentration risk.

For Microsoft Cloud services, it's possible to identify potential concentrations for critical workloads by reviewing subscription and tenant IDs in the Azure portal. Firms may also view and filter Azure resource information that can help provide a detailed understanding of which services are being used within an organization. This information is exportable and can be correlated against internal BIA information to help identify critical dependencies in real-time. For Microsoft 365, firms can also generate reports in the Admin Center both on purchased licenses and on activity.

Step 3 – Assess alternatives

Once a firm understands its dependencies and how they relate to critical use cases, the next step is to address associated concentration risk. Start by identifying a short list of feasible alternatives that can be investigated further in depth. Questions you may want to address include:

Which practical alternatives exist across on-premises, hybrid, multi-sourcing, and full cloud?
What are the drawbacks and benefits of each?
How do their risk profiles compare with each other? How resilient is each alternative?
Which ones best fit the organization’s risk appetite and cloud strategy?
When planning an exit, is a full vendor exit possible and desirable? What alternatives can be considered?

Concentration risk is an aggregate term pointing to the higher impact an adverse event would have on one or more critical services. When assessing such risks, one must evaluate each of the underlying threat scenarios, which in turn leads to a nuanced view that includes both benefits and drawbacks associated with concentration of services. Threat factors to consider include data center disasters, hardware failures, network outages, cyber-attacks, faulty changes and upgrades, human errors etc. For each of these appropriate mitigating measures should be carefully considered. Firms must also consider mitigation costs, complexity, and availability of in-house skills when considering the preferential solution for addressing concentration risk.

It may be possible and, in some cases, even desirable to maintain concentration so firms can maximally strengthen resilience. Rather than to try and remove the third-party dependency entirely, firms should focus on strengthening operational resilience by addressing the underlying threat scenarios associated with concentration risk (for example: a regional data center disaster event). These scenarios can often be easily addressed and with less drawbacks by (i) reducing the probability that the threat event occurs and (ii) limiting its impact by reducing concentration:

Reducing probability is achieved by strengthening resilience in the solution design. A robust set of risk management procedures can enhance operational resilience despite concentration of critical functions with a single third-party provider. Measures may include running on state-of-the-art infrastructure, running a zero-trust security model, patching systems with the latest updates, ensuring business continuity measures have been set up and proven to work, deploying modular and open-source technologies, (for example: containers) etc. Each of these contributes towards obtaining a maximally resilient environment.
Limiting impact by reducing concentration at lower levels is achieved by designing your services to operate across multiple availability zones in an active/active configuration; by ensuring sufficient redundancies and recovery mechanisms are in place (for instance backups), and by leveraging geo-redundant designs. Such configurations not only result in higher resilience and better SLAs, but also help to mitigate against threats such as the loss of a single data center or even an entire region due to their distributed nature. The impact of threats can be reduced, hereby also reducing concentration risk, in some cases even going beyond what is feasible in on-premises or hybrid scenarios.

In conclusion, if a full cloud topology offers higher resilience compared to alternatives, concentration risk will also be effectively reduced although concentration itself is not diminished. This outcome can be both acceptable and even desirable.

We recommend incorporating this operating model into a corporate cloud policy as this provides guidance to business and ICT teams on what the preferred topology is for an organization and which elements to consider as part of their solution design. Another reason to consider this is because strategic alliances are often made with one or more cloud providers for the delivery of ICT third-party services and approaches for managing associated risks are often managed at a higher level.

Step 4 – Design for resilience

At this stage your ICT technology, security and operations teams start to dive deep into the solution design by building the conclusions and requirements from previous into individual user cases.

ICT Teams must ensure that they configure and design applications to be secure and resilient by default. This means ensuring that solutions are reliable, secure, free from single points of failures, and potentially leverage availability zones to establish recovery time objectives (RTOs), recovery point objectives (RPOs), and service levels (SLAs) as required by the business. Implementing backups where necessary, updating business continuity and exit plans is also part of the process.

When trying to maximize the end-to-end SLAs with Microsoft cloud technology, consider reviewing our SLAs for Microsoft Online Services. You'll find that SLAs vary by service and are dependent upon customer design choices, with higher SLAs for deployments across multiple availability zones. By choosing the right design, customers may achieve a 99.999% monthly availability SLA in the cloud.

Ensuring a strong and secure-by-default design is no easy task and there's a risk that ICT teams leave weaknesses or make errors during their implementations. This challenge is why we offer guidance on best practices when deploying on the Microsoft cloud.

SaaS services such as Microsoft 365 and Dynamics 365 have been designed from the ground up to maximize resilience and minimize disruption of these cloud services. We have built-in redundancies for services such as Exchange, SharePoint, OneDrive, Teams & Microsoft Entra and designed the service in a way that it can operate with multiple layers of abstraction between the hardware- and data center layers. Details on how this works is in our service assurance documentation here on Microsoft Compliance.

We also offer guidance on how to deploy ransomware protection for your Microsoft 365 tenant by leveraging built-in versioning and restore capabilities of Microsoft 365. When configured correctly this may be sufficient for your organizations, but many large enterprises seek to have capabilities to recover and restore from any incident – including ransomware attacks – in a more granular way and crossing extended time spans. This is where Microsoft 365 backup comes in which will help to recover quickly with frequent recovery points and fast bulk recovery times. It's available both as a Microsoft service as well as through some partners who may offer additional features to address your needs. You can find recognized partner solutions built on the Microsoft 365 Backup Storage platform here. You can also read about how to make a backup solution procurement decision in this whitepaper. Solutions that rely solely on an exported copy of your Microsoft 365 data will struggle to provide sufficient recovery time performance to help you regain business operations after a resilience-impacting event. Please consider the performance of the solution you procure carefully so that you are best positioned to meet DORA requirements.

Things can become complex when a solution is built on Azure (IaaS). Therefore, the Microsoft Azure Well-Architected Framework was created by Microsoft to provide additional guidance on how to design for reliability and security. Microsoft provides resources to aid firms in implementing high resilience scenarios, including an overview of the Azure reliability, the overview of Azure security and review the concepts of Azure Availability Zones and Regions. Other aspects to consider include establishing operational excellence and to a lesser extent optimizing cost and performance.

A lot of attention must therefore go to overall security, especially in financial services. We recommend evaluating our Security in the Microsoft Cloud Adoption Framework for Azure, which provides core principles and practical guidance on how to address today’s cybersecurity threats most effectively on our cloud. It also includes links to reference architectures and security baselines for different use cases.

Finally, ICT Teams should deploy cloud resources in a way that meet all regulatory and internal policy requirements. Azure Governance helps firms implement these regulatory and internal control requirements by enforcing such policies to Azure cloud resources. In addition, firms may review our introduction to Azure hybrid and multicloud, which helps support hybrid and multicloud deployments.

Step 5 – Test business continuity plan

The process of business continuity planning and testing has been well established in regulated FSIs. The focus in this step is on evaluating the impacts that may cause disruption of the third-party outsourcing (use of Microsoft cloud) on this process. A good example of testing is addressed in Article 26 Testing of the ICT business continuity plans of the DORA draft RTS on ICT risk management tools methods processes and policies.

There's an aspect of shared responsibility to this, and our Microsoft resiliency and continuity overview helps customers prepare for disaster scenarios. It offers specific guidance for each Microsoft cloud service, including how Microsoft deals with business continuity for each service and guidance on how customers can leverage Microsoft cloud services to prepare for disaster events.

Business continuity should be tested regularly by financial firms when using Microsoft Cloud as the responsibility follows the shared responsibility model. For SaaS solutions such as Microsoft 365 and for some Azure services, Microsoft is responsible for performing these tests at regular intervals. For more insights, customers may review Microsoft’s quarterly Business Continuity and Disaster Recovery Plan Validation Report that Microsoft publishes on its Service Trust Portal under the Business Continuity and Disaster Recovery section for more insights on how specific Microsoft cloud services have performed.

Step 6 – Prepare exit plans

Some threat scenarios can't be managed with business continuity plans or technical resiliency measures, such as the risk of bankruptcy or resolution of the third-party provider. An exit plan has the benefit of dealing with such catastrophic scenarios and should be seen as complimentary to having tested business continuity plans.

This is why every organization should have an overall exit strategy and individual exit plans for its critical use cases as this is grounded in several regulations and future guidelines.

Section 3.7 of the June 23 FSB Consultation offers helpful guidance pointing to elements of an exit strategy and exit plans such (i) contractually agreeing on transition periods to minimize the risk of disruption; (ii) ensuring that logical and physical assets including data and applications are returned in a cost-effective and timely manner and (iii) having contractual provisions relating to the ownership, maintenance, preservation, and long-term availability of records as appropriate.

DORA Article 28 (8) also requires that firms develop exit strategies for ICT services supporting critical or important functions. Similarly, the UK PRA’s Supervisory Statement SS2/21 on Outsourcing and third party risk management of March 2021 has an extensive description of exit planning requirements in section 10 and also introduces the concept of stressed exits. Financial firms in these jurisdictions must evaluate these guidelines and put in place adequate contractual arrangements that can support the execution hereof, for instance by requiring mandatory transition periods during which ICT third-party service providers continue providing relevant services (example: DORA Article 30.3.f references the contractual establishment of a mandatory adequate transition period).

A technical paper written by the European Banking Federation in June 2020 describes some of the ways exit plan testing can be achieved for financial institutions in line with EBA Guidelines.