Business continuity and disaster recovery options for FSLogix

Note

All diagrams are examples based on Azure Virtual Desktop and are applicable to other virtual desktop platforms.

An effective business continuity and disaster recovery (BCDR) plan focuses on the processes and resources necessary for an organization to operate if a catastrophe or other significant outage. Roaming user profiles aren't commonly described as a business or mission-critical component of a BCDR strategy. In a virtual desktop environment, a user is unaware they have a roaming profile. The profile is roamed to provide users with a consistent experience regardless of the virtual machine. Business or mission-critical data shouldn't be stored in a user's profile if at all possible. Using OneDrive, SharePoint or other solutions are an effective means for protecting data during a BCDR event while not relying on the data roaming with the user as part of their profile. This process is best outlined in a recovery-time objective (RTO) and recovery-point objective (RPO) exercise where the cost benefit and risk analysis can be weighed base on organizational and business goals.

Option 1: No profile recovery

While this option doesn't seem like a BCDR design, it's focused on ensuring business and mission-critical data isn't in the user's profile. During a disaster, users would create new profiles in either a new location or on a new storage provider (both can be true). This option is the most cost effective in terms of infrastructure cost though has a penalty due to the effect it may have on the user experience.

F S Logix no profile recovery

Figure 1: No Profile Recovery | FSLogix standard containers (VHDLocations)

In the diagram, is a multi-region Host Pool using Azure Virtual Desktop. Both the primary and failover regions have a dedicated Azure Files share using zone-redundant storage (ZRS) which provides high availability within the region. The failover region has Session Hosts, which are stopped or deallocated. In a disaster, the failover region becomes the primary region and users will sign-in to those Session Hosts and create new profiles on the Azure Files share in that region.

Option 2: Cloud Cache (primary / failover)

A failover design is a common strategy to ensure the availability and reliability of your infrastructure in case of a disaster or a failure. Cloud Cache enables you to use FSLogix using this type of failover design. With Cloud Cache, you can configure your devices to use two (2) storage providers that store your profile data in different locations. Cloud Cache synchronizes your profile data to each of the two storage providers asynchronously, so you always have the latest version of your data. Some of your devices are in the primary location and the other devices are in the failover location. Cloud Cache prioritizes the first storage provider (closest to your device), and uses the other storage provider as a backup. For example, if your primary device is in West US and your failover device is in East US, you can configure Cloud Cache as follows:

  • The primary device uses a storage provider in West US as the first option and a storage provider in East US as the second option.
  • The failover device uses a storage provider in East US as the first option and a storage provider in West US as the second option.
  • If the primary device or the closest storage provider fails, you can switch to the failover device or the backup storage provider and continue your work without losing your profile data.

However, there are some drawbacks of using a failover design with Cloud Cache. First, you have to pay extra for storing your profile data in two (2) locations. Second, you have to manually initiate the failover process, which may require the approval of the business stakeholders. Third, you may experience some latency or inconsistency in your profile data due to the asynchronous synchronization to the two storage providers.

Tip

  • Before allowing users to fail back to profiles in the primary location, be sure all users have signed out successfully from the failover location to ensure the primary location has an up to date replica of the user's profile data.
  • Cloud Cache is an I/O intensive system and can easily cause network and/or storage bottlenecks to the restored location.

F S Logix disaster recovery failover

Figure 2: Cloud Cache (primary / failover) | FSLogix Cloud Cache (CCDLocations)

In the diagram, we have a multi-region Host Pool utilizing Azure Virtual Desktop. Both the primary and failover regions are part of this setup. They each have a dedicated Azure Files share using zone-redundant storage (ZRS), ensuring high availability within the region. The failover region contains Session Hosts, which are either stopped or deallocated. In the event of a disaster, the failover region becomes the primary region. Users will sign in to these Session Hosts and load their replicated profile from the failover region.

However, it’s essential to consider the following:

  • BCDR (Business Continuity and Disaster Recovery) events are rarely graceful. Depending on the circumstances, user profile data may not be guaranteed to be intact.
  • Users signing in to Session Hosts in the failover region could experience data loss or, in worse cases, container corruption.

Given this situation, it’s crucial to use storage platforms like OneDrive or SharePoint for critical data. These platforms provide additional redundancy and protection against data loss. Remember, planning for disaster recovery is essential, and having the right storage strategy can mitigate risks and ensure business continuity.

Option 3: Cloud Cache (active / active)

When discussing infrastructure, it is common to use active/active designs, which can also be applied to an FSLogix profile solution. With this option, Cloud Cache is set up with two storage providers that are updated asynchronously to reflect all changes made to the local cache. The storage provider closest to the active location is listed first, while the furthest provider is listed second. In the other location, the order is reversed. This option incurs additional costs for storing provider data in two locations and requires a manual decision by business stakeholders before initiating a failover.

Tip

  • When the failed region is operational, it may take significant time for the profile data to fully replicate.
  • Cloud Cache is an I/O intensive system and can easily cause network and/or storage bottlenecks to the restored location.

F S Logix active active

Figure 3: Cloud Cache (active / active) | FSLogix Cloud Cache (CCDLocations)

In the diagram, are two (2) AVD Host Pools and Session Hosts residing in specific Azure regions. Users assigned to the West US region, access those virtual machines. Users in the East US region only access and are assigned to those virtual machines. During a disaster, the surviving region must have enough capacity to support all the users. Additionally, users from the failed region need access granted to the virtual machines in the surviving region.

BCDR events are never graceful and depending on the circumstances of the event, user profile data isn't guaranteed to be intact. Users who sign-in to Session Hosts in the surviving region could experience data loss or at worse container corruption. This situation amplifies the need to use storage platforms like OneDrive or SharePoint for critical user data.