Clustering on your Azure Stack Edge Pro GPU device

Grein
03/21/2024

APPLIES TO: Yes for Pro GPU SKU Azure Stack Edge Pro - GPU Yes for Pro 2 SKU Azure Stack Edge Pro 2

This article provides a brief overview of clustering on your Azure Stack Edge device.

About failover clustering

Azure Stack Edge can be set up as a single standalone device or a two-node cluster. A two-node cluster consists of two independent Azure Stack Edge devices that are connected by physical cables and by software. These nodes when clustered work together as in a Windows failover cluster, provide high availability for applications and services that are running on the cluster.

If one of the clustered nodes fails, the other node begins to provide service (the process is known as failover). The clustered roles are also proactively monitored to make sure that they’re working properly. If they aren’t working, they’re restarted or moved to the second node.

Azure Stack Edge uses Windows Server Failover Clustering for its two-node cluster. For more information, see Failover clustering in Windows Server.

Cluster quorum and witness

A quorum is always maintained on your Azure Stack Edge cluster to remain online in the event of a failure. If one of the nodes fails, then the majority of the surviving nodes must verify that the cluster remains online. The concept of majority only exists for clusters with an odd number of nodes. For more information on cluster quorum, see Understand quorum.

For an Azure Stack Edge cluster with two nodes, if a node fails, then a cluster witness provides the third vote so that the cluster stays online (since the cluster is left with two out of three votes - a majority). A cluster witness is required on your Azure Stack Edge cluster. You can set up the witness in the cloud or in a local fileshare using the local UI of your device.

For more information about the cluster witness, see Cluster witness on Azure Stack Edge.
For more information about witness in the cloud, see Configure cloud witness.
For detailed steps to deploy a cloud witness, see Deploy cloud witness for a failover cluster.

Infrastructure cluster

The infrastructure cluster on your device provides persistent storage and is shown in the following diagram:

Infrastructure cluster of Azure Stack Edge

The infrastructure cluster consists of the two independent nodes running Windows Server operating system with a Hyper-V layer. The nodes contain physical disks for storage and network interfaces that are connected back-to-back or with switches.
The disks across the two nodes are used to create a logical storage pool. The storage spaces direct on this pool provides mirroring and parity for the cluster.
You can deploy your application workloads on top of the infrastructure cluster.
- Non-containerized workloads such as VMs can be directly deployed on top of the infrastructure cluster.
- Containerized workloads use Kubernetes for workload deployment and management. A Kubernetes cluster that consists of a master VM and two worker VMs (one for each node) is deployed on top of the infrastructure cluster.
The Kubernetes cluster allows for application orchestration whereas the infrastructure cluster provides persistent storage.

Supported network topologies

Based on the use case and workloads, you can select how the two Azure Stack Edge device nodes will be connected. Network topologies will differ depending on whether you use an Azure Stack Edge Pro GPU device or an Azure Stack Edge Pro 2 device.

At a high level, supported network topologies for each of the device types are described here.

Azure Stack Edge Pro GPU
Azure Stack Edge Pro 2

On your Azure Stack Edge Pro GPU device node:

Port 2 is used for management traffic.
Port 3 and Port 4 are used for storage and cluster traffic. This traffic includes that needed for storage mirroring and Azure Stack Edge cluster heartbeat traffic that is required for the cluster to be online.

The following network topologies are available:

Available network topologies

Option 1 - Switchless - Use this option when you don't have high speed switches available in the environment for storage and cluster traffic.

In this option, Port 3 and Port 4 are connected back-to-back without a switch. These ports are dedicated to storage and Azure Stack Edge cluster traffic and aren't available for workload traffic. Optionally you can also provide IP addresses for these ports.
Option 2 - Use switches and NIC teaming - Use this option when you have high speed switches available for use with your device nodes for storage and cluster traffic.

Each of ports 3 and 4 of the two nodes of your device are connected via an external switch. The Port 3 and Port 4 are teamed on each node and a virtual switch and two virtual NICs are created that allow for port-level redundancy for storage and cluster traffic. These ports can be used for workload traffic as well.
Option 3 - Use switches without NIC teaming - Use this option when you need an extra dedicated port for workload traffic and port-level redundancy isn’t required for storage and cluster traffic.

Port 3 on each node is connected via an external switch. If Port 3 fails, the cluster may go offline. Separate virtual switches are created on Port 3 and Port 4.

For more information, see how to Choose a network topology for your device node.

On your Azure Stack Edge Pro 2 device node:

Option 1 - Port 1 and Port 2 are in different subnets. Separate virtual switches are created. Port 3 and Port 4 connect to an external virtual switch.
Option 2 - Port 1 and Port 2 are in the same subnet. A teamed virtual switch is created. Port 3 and Port 4 connect to an external virtual switch.
Option 3 - Port 1 and Port 2 are in separate subnets. Separate virtual switches are created on Port 1 and Port 2. Port 3 and Port 4 are connected back-to-back, switchless for Port 3 and Port 4.
Option 4 - Port 1 and Port 2 are in the same subnet. A teamed virtual switch is created. Port 3 and Port 4 are connected back-to-back, switchless for Port 3 and Port 4.

Note

If you run PMEC workloads, use Option 1 or Option 2.

Usage considerations on your Azure Stack Edge Pro 2 device nodes:

Switchless for Port 3 and Port 4 - Use this option when you don't have high speed switches available in the environment, or you want to dedicate Port 3 and Port 4 for storage and cluster traffic.
- Port 1 and Port 2 in separate subnets - This is the default option. In this case, Port 1 and Port 2 have separate virtual switches and are connected to separate subnets.
- Port 1 and Port 2 in the same subnet - In this case, Port 1 and Port 2 have a teamed virtual switch and both ports are in the same subnet.
Using external switches for Port 3 and Port 4 - Use this option when you have high speed switches (>=10 GbE bandwidth) available for use with your device nodes and you want to allow a VM network adapter to connect to the virtual network created on Port 3 or Port 4, like a PMEC use case.
- Port 1 and Port 2 in separate subnets - This is the default option. In this case, Port 1 and Port 2 have separate virtual switches and are connected to separate subnets.
- Port 1 and Port 2 in the same subnet - In this case, Port 1 and Port 2 have a teamed virtual switch and both ports are in the same subnet.

Additional considerations:

Port 1 is used for initial configuration. Port 1 is then reconfigured and assigned an IP address that may or may not be in the same subnet as Port 2.
If you select the Using external switches option, Port 1 and Port 2 are used for storage in both teaming and non-teaming modes.
When using the Switchless option, Port 3 and Port 4 are connected back-to-back directly, without a switch. These ports are dedicated to storage and Azure Stack Edge cluster traffic. Port 3 and Port 4 aren't available for workload traffic.

Pros and cons for supported topologies are summarized as follows:

Local web UI option	Advantages	Disadvantages
Port 3 and Port 4 are Switchless, Port 1 and Port 2 in separate subnet, separate virtual switches.	Redundant paths for management and storage traffic.	Clients must reconnect if Port 1 or Port 2 fails.
	No single point of failure within the device.	VM workload can't leverage Port 3 or Port 4 to connect to network endpoints other than a peer Azure Stack Edge node. This is why PMEC workloads can't use this option.
	Lots of bandwidth for storage and cluster traffic across nodes.
	Can be deployed with Port 1 and Port 2 in different subnets.
Port 3 and Port 4 are Switchless, Port 1 and Port 2 are in the same subnet, teamed virtual switch.	Redundant paths for management and storage traffic.	VM workload can't leverage Port 3 or Port 4 to connect to network endpoints other than a peer Azure Stack Edge node. This is why PMEC workloads can't use this option.
	Lots of bandwidth for storage and cluster traffic across nodes.
	Higher fault tolerance.
Port 3 and Port 4 use an external switch with >=10Gbps link bandwidth, Port 1 and Port 2 in separate subnets, separate virtual switches	Two independent virtual switches and network paths provide redundancy.	Clients must reconnect if Port 1 or Port 2 fails.
	No single point of failure with the device.
	Port 1 and Port 2 can be connected to different subnets.
Port 3 and Port 4 use an external switch with >=10Gbps link bandwidth, Port 1 and Port 2 in the same subnet, teamed virtual switch.	Load balancing.
	Higher fault tolerance.	Can't be deployed in an environment with different subnets.
	Two independent, redundant paths between nodes.
	Clients don't need to reconnect.

Before you configure clustering on your device, you must cable the devices as per one of the supported network topologies that you intend to configure. To deploy a two-node infrastructure cluster on your Azure Stack Edge devices, follow these high-level steps:

Figure showing the steps in the deployment of a two-node Azure Stack Edge

Order two independent Azure Stack Edge devices. For more information, see Order an Azure Stack Edge device.
Cable each node independently as you would for a single node device. Based on the workloads that you intend to deploy, cross connect the network interfaces on these devices via cables, and with or without switches. For detailed instructions, see Cable your two-node cluster device.
Start cluster creation on the first node. Choose the network topology that conforms to the cabling across the two nodes. The chosen topology would dictate the storage and clustering traffic between the nodes. See detailed steps in Configure network and web proxy on your device.
Prepare the second node. Configure the network on the second node the same way you configured it on the first node. Ensure that port settings match between same port name on each appliance. Get the authentication token on this node.
Use the authentication token from the prepared node and join this node to the first node to form a cluster.
Set up a cloud witness using an Azure Storage account or a local witness on an SMB fileshare.
Assign a virtual IP to provide an endpoint for Azure Consistent Services or when using NFS.
Assign compute or management intents to the virtual switches created on the network interfaces. You may also configure Kubernetes node IPs and Kubernetes service IPs here for the network interface enabled for compute.
Optionally configure web proxy, set up device settings, configure certificates and then finally, activate the device.

For more information, see the two-node device deployment tutorials starting with Get deployment configuration checklist.

Clustering workloads

On your two-node cluster, you can deploy non-containerized workloads or containerized workloads.

Non-containerized workloads such as VMs: The two-node cluster will ensure high availability of the virtual machines that are deployed on the device cluster. Live migration of VMs isn’t supported.
Containerized workloads such as Kubernetes or IoT Edge: The Kubernetes cluster deployed on top of the device cluster consists of one Kubernetes master VM and two Kubernetes worker VMs. Each Kubernetes node has a worker VM that is pinned to each Azure Stack Edge node. Failover results in the failover of Kubernetes master VM (if needed) and Kubernetes-based rebalancing of pods on the surviving worker VM.

For more information, see Kubernetes on a clustered Azure Stack Edge device.

Cluster management

You can manage the Azure Stack Edge cluster via the PowerShell interface of the device, or through the local UI. Some typical management tasks are:

Cluster updates

A two-node clustered device upgrade will first apply the device updates followed by the Kubernetes cluster updates. Rolling updates to device nodes ensure minimal downtime of workloads.

When you apply these updates via the Azure portal, you only have to start the process on one node and both the nodes are updated. For step-by-step instructions, see Apply updates to your two-node Azure Stack Edge device.

Billing

If you deploy an Azure Stack Edge two-node cluster, each node is billed separately. For more information, see Pricing page for Azure Stack Edge.

Next steps

Learn about Cluster witness for your Azure Stack Edge.
See Kubernetes for your Azure Stack Edge
Understand Cluster failover scenarios

Deila með