Fault domain awareness
Failover Clustering enables multiple servers to work together to provide high availability – or put another way, to provide node fault tolerance. But today's businesses demand ever-greater availability from their infrastructure. To achieve cloud-like uptime, even highly unlikely occurrences such as chassis failures, rack outages, or natural disasters must be protected against. That's why Failover Clustering in Windows Server 2016 introduced chassis, rack, and site fault tolerance as well.
Fault domains and fault tolerance
Fault domains and fault tolerance are closely related concepts. A fault domain is a set of hardware components that share a single point of failure. To be fault tolerant to a certain level, you need multiple fault domains at that level. For example, to be rack fault tolerant, your servers and your data must be distributed across multiple racks.
This short video presents an overview of fault domains in Windows Server 2016.
Fault domain awareness in Windows Server 2019
Fault domain awareness is available in Windows Server 2019 but it's disabled by default and must be enabled through the Windows Registry.
To enable fault domain awareness in Windows Server 2019, go to the Windows Registry and set the (Get-Cluster).AutoAssignNodeSite registry key to 1.
(Get-Cluster).AutoAssignNodeSite=1
To disable fault domain awareness in Windows 2019, go to the Windows Registry and set the (Get-Cluster).AutoAssignNodeSite registry key to 0.
(Get-Cluster).AutoAssignNodeSite=0
Benefits
Storage Spaces, including Storage Spaces Direct, uses fault domains to maximize data safety. Resiliency in Storage Spaces is conceptually like distributed, software-defined RAID. Multiple copies of all data are kept in sync, and if hardware fails and one copy is lost, others are recopied to restore resiliency. To achieve the best possible resiliency, copies should be kept in separate fault domains.
The Health Service uses fault domains to provide more helpful alerts. Each fault domain can be associated with location metadata, which will automatically be included in any subsequent alerts. These descriptors can assist operations or maintenance personnel and reduce errors by disambiguating hardware.
Stretch clustering uses fault domains for storage affinity. Stretch clustering allows faraway servers to join a common cluster. For the best performance, applications or virtual machines should be run on servers that are nearby to those providing their storage. Fault domain awareness enables this storage affinity.
Levels of fault domains
There are four canonical levels of fault domains - site, rack, chassis, and node. Nodes are discovered automatically; each additional level is optional. For example, if your deployment doesn't use blade servers, the chassis level may not make sense for you.
Usage
You can use PowerShell or XML markup to specify fault domains. Both approaches are equivalent and provide full functionality.
Important
Specify fault domains before enabling Storage Spaces Direct, if possible. This enables the automatic configuration to prepare the pool, tiers, and settings like resiliency and column count, for chassis or rack fault tolerance. Once the pool and volumes have been created, data will not retroactively move in response to changes to the fault domain topology. To move nodes between chassis or racks after enabling Storage Spaces Direct, you should first evict the node and its drives from the pool using Remove-ClusterNode -CleanUpDisks
.
Defining fault domains with PowerShell
Windows Server 2016 introduces the following cmdlets to work with fault domains:
Get-ClusterFaultDomain
Set-ClusterFaultDomain
New-ClusterFaultDomain
Remove-ClusterFaultDomain
This short video demonstrates the usage of cluster fault domain PowerShell commands.
Use Get-ClusterFaultDomain
to see the current fault domain topology. This lists all nodes in the cluster, plus any chassis, racks, or sites you have created. You can filter using parameters like -Type or -Name, but these are not required.
Get-ClusterFaultDomain
Get-ClusterFaultDomain -Type Rack
Get-ClusterFaultDomain -Name "server01.contoso.com"
Use New-ClusterFaultDomain
to create new chassis, racks, or sites. The -Type
and -Name
parameters are required. The possible values for -Type
are Chassis
, Rack
, and Site
. The -Name
can be any string. (For Node
type fault domains, the name must be the actual node name, as set automatically).
New-ClusterFaultDomain -Type Chassis -Name "Chassis 007"
New-ClusterFaultDomain -Type Rack -Name "Rack A"
New-ClusterFaultDomain -Type Site -Name "Shanghai"
Important
Windows Server cannot and does not verify that any fault domains you create correspond to anything in the real, physical world. (This may sound obvious, but it's important to understand.) If, in the physical world, your nodes are all in one rack, then creating two -Type Rack
fault domains in software does not magically provide rack fault tolerance. You are responsible for ensuring the topology you create using these cmdlets matches the actual arrangement of your hardware.
Use Set-ClusterFaultDomain
to move one fault domain into another. The terms "parent" and "child" are commonly used to describe this nesting relationship. The -Name
and -Parent
parameters are required. In -Name
, provide the name of the fault domain that is moving; in -Parent
, provide the name of the destination. To move multiple fault domains at once, list their names.
Set-ClusterFaultDomain -Name "server01.contoso.com" -Parent "Rack A"
Set-ClusterFaultDomain -Name "Rack A", "Rack B", "Rack C", "Rack D" -Parent "Shanghai"
Important
When fault domains move, their children move with them. In the above example, if Rack A is the parent of server01.contoso.com, the latter does not separately need to be moved to the Shanghai site – it is already there by virtue of its parent being there, just like in the physical world.
You can see parent-child relationships in the output of Get-ClusterFaultDomain
, in the ParentName
and ChildrenNames
columns.
You can also use Set-ClusterFaultDomain
to modify certain other properties of fault domains. For example, you can provide optional -Location
or -Description
metadata for any fault domain. If provided, this information will be included in hardware alerting from the Health Service. You can also rename fault domains using the -NewName
parameter. Do not rename Node
type fault domains.
Set-ClusterFaultDomain -Name "Rack A" -Location "Building 34, Room 4010"
Set-ClusterFaultDomain -Type Node -Description "Contoso XYZ Server"
Set-ClusterFaultDomain -Name "Shanghai" -NewName "China Region"
Use Remove-ClusterFaultDomain
to remove chassis, racks, or sites you have created. The -Name
parameter is required. You cannot remove a fault domain that contains children – first, either remove the children, or move them outside using Set-ClusterFaultDomain
. To move a fault domain outside of all other fault domains, set its -Parent
to the empty string (""). You cannot remove Node
type fault domains. To remove multiple fault domains at once, list their names.
Set-ClusterFaultDomain -Name "server01.contoso.com" -Parent ""
Remove-ClusterFaultDomain -Name "Rack A"
Defining fault domains with XML markup
Fault domains can be specified using an XML-inspired syntax. We recommend using your favorite text editor, such as Visual Studio Code (available for free here) or Notepad to create an XML document that you can save and reuse.
This short video demonstrates the usage of XML to specify fault domains in failover clustering.
In PowerShell, run the following cmdlet: Get-ClusterFaultDomainXML
. This returns the current fault domain specification for the cluster, as XML. This reflects every discovered <Node>
, wrapped in opening and closing <Topology>
tags.
Run the following to save this output to a file.
Get-ClusterFaultDomainXML | Out-File <Path>
Open the file, and add <Site>
, <Rack>
, and <Chassis>
tags to specify how these nodes are distributed across sites, racks, and chassis. Every tag must be identified by a unique Name. For nodes, you must keep the node's name as populated by default.
Important
While all additional tags are optional, they must adhere to the transitive Site > Rack > Chassis > Node hierarchy, and must be properly closed.
In addition to name, freeform Location="..."
and Description="..."
descriptors can be added to any tag.
Example: Two sites, one rack each
<Topology>
<Site Name="SEA" Location="Contoso HQ, 123 Example St, Room 4010, Seattle">
<Rack Name="A01" Location="Aisle A, Rack 01">
<Node Name="Server01" Location="Rack Unit 33" />
<Node Name="Server02" Location="Rack Unit 35" />
<Node Name="Server03" Location="Rack Unit 37" />
</Rack>
</Site>
<Site Name="NYC" Location="Regional Datacenter, 456 Example Ave, New York City">
<Rack Name="B07" Location="Aisle B, Rack 07">
<Node Name="Server04" Location="Rack Unit 20" />
<Node Name="Server05" Location="Rack Unit 22" />
<Node Name="Server06" Location="Rack Unit 24" />
</Rack>
</Site>
</Topology>
Example: two chassis blade servers
<Topology>
<Rack Name="A01" Location="Contoso HQ, Room 4010, Aisle A, Rack 01">
<Chassis Name="Chassis01" Location="Rack Unit 2 (Upper)" >
<Node Name="Server01" Location="Left" />
<Node Name="Server02" Location="Right" />
</Chassis>
<Chassis Name="Chassis02" Location="Rack Unit 6 (Lower)" >
<Node Name="Server03" Location="Left" />
<Node Name="Server04" Location="Right" />
</Chassis>
</Rack>
</Topology>
To set the new fault domain specification, save your XML and run the following in PowerShell.
$xml = Get-Content <Path> | Out-String
Set-ClusterFaultDomainXML -XML $xml
This guide presents just two examples, but the <Site>
, <Rack>
, <Chassis>
, and <Node>
tags can be mixed and matched in several ways to reflect the physical topology of your deployment, whatever that may be. We hope these examples illustrate the flexibility of these tags and the value of freeform location descriptors to disambiguate them.
Optional: Location and description metadata
You can provide optional Location or Description metadata for any fault domain. If provided, this information will be included in hardware alerting from the Health Service.
This short video demonstrates the value of adding location descriptors to fault domains.