How to minimize the failover time in Microsoft Hyper V Failover Cluster

Somnathndy 46 Reputation points
2025-12-01T15:29:20.54+00:00

Hi,

I have installed MS Hyper V Failover Cluster based on Windows Server 2022 Standard. The Cluster contains 2xDL380 Gen11 Nodes, 1xHPE MSA2060 FC SAN directly connected. Installed cluster after successful validation (only warning on MS update level difference was there, though I updated through server manager fully untill it said "you are fully updated" and created the cluster). The cluster having 5 different networks including HeartBeat, which is directly connected through cross cable. I have kept AD based on 2022 separately on a 3rd hardware and nodes have joined the domain. I have created 2 VMs on each nodes and when tested the failover by powering off node 2, the vm on node 2 took 5 minutes time to failover from node 2 to node 1. I think this is not normal. Need help from you. Please help.

Thanks in advance.

Regards

Somnath Nandy

Windows for business | Windows Server | Devices and deployment | Install Windows updates, features, or roles
{count} votes

5 answers

Sort by: Most helpful
  1. Q&A Assist
    2025-12-01T15:29:36.45+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To minimize the failover time in your Microsoft Hyper-V Failover Cluster, consider the following best practices:

    1. Network Configuration: Ensure that your cluster networks are optimized for low latency. Use dedicated networks for cluster heartbeats and live migrations to reduce congestion. This can help improve communication between nodes during failover events.
    2. Heartbeat Settings: Adjust the heartbeat settings for your cluster. The default settings may be too sensitive or not sensitive enough for your environment. You can tune the Delay and Threshold settings to better suit your network conditions. For example, increasing the heartbeat delay can help reduce false positives in node failures.
    3. Storage Performance: Ensure that your storage subsystem (HPE MSA2060 FC SAN) is performing optimally. High I/O performance is crucial for reducing failover times. Consider using shared storage configurations that allow multiple nodes to access the same storage concurrently, which can improve failover performance.
    4. VM Configuration: Review the configuration of your virtual machines. Ensure that they are set up for high availability and that their resources (CPU, memory, etc.) are adequately provisioned. This can help reduce the time it takes for a VM to restart on another node.
    5. Monitoring and Alerts: Implement monitoring tools to keep an eye on the health of your cluster. This can help you identify issues before they lead to longer failover times.
    6. Regular Testing: Regularly test your failover processes to identify any bottlenecks or issues that could be causing delays. This will also help you ensure that your configurations are optimal.

    By following these practices, you should be able to reduce the failover time in your Hyper-V Failover Cluster significantly.


    References:

    0 comments No comments

  2. VPHAN 10,000 Reputation points Independent Advisor
    2025-12-01T17:04:51.6966667+00:00

    Good morning Somnathndy,

    Your immediate focus should be on the "failure detection timeline". A delay of this magnitude typically points to the cluster service taking too long to declare the other node as down. This is governed by the heartbeat and threshold settings. Check the current configuration using PowerShell: Get-Cluster | fl SameSubnetDelay, SameSubnetThreshold, CrossSubnetDelay, CrossSubnetThreshold. The default values (e.g., 1 second delay, 5 heartbeats missed) are usually appropriate for a dedicated, isolated heartbeat network. However, if the cluster is not correctly configured to use your cross-cable connection as the primary heartbeat network, it may be relying on a slower, congested production network for node health. Verify this by running Get-ClusterNetwork and ensuring your heartbeat network has the lowest metric and is correctly set for "Allow cluster network communication" and "Allow clients to connect through this network" should be "No".

    Next, examine the storage failover path. The MSA2060 must be configured with proper multipath I/O (MPIO) with an optimized load balance policy (like Least Blocks) and fast failover. On each node, run mpclaim -s -d to view disk paths and ensure at least two active/active paths are present per LUN. A delay can occur if the surviving node struggles to gain exclusive access to the CSV disks. Review the Failover Clustering and CSVFS event logs in Event Viewer on both nodes around the time of the test. Look for warnings or errors related to SBL (Storage Bus Layer) timeouts, disk arbitration, or resource group failures.

    Finally, scrutinize the VM configuration itself. The five minutes could be a sum of the node failure detection time plus the VM save/start operation. Check if the VMs are set to Save state during live migration/failover, as restoring a saved state is slower than a quick failover. Ensure Hyper-V Integration Services are updated to the latest version on all VMs and that you are not using Dynamic Memory for these clustered VMs, as its rebalancing can add overhead. The update warning during validation, while often benign, can sometimes mask a driver or firmware incompatibility. Ensure both nodes have identical HBA drivers, firmware, and Windows updates, as mismatches can cause subtle performance penalties during takeover.

    I hope you've found something useful here. If it helps you get more insight into the issue, it's appreciated to ACCEPT ANSWER then. Should you have more questions, feel free to leave a message. Have a nice day!

    VPHAN

    0 comments No comments

  3. Somnathndy 46 Reputation points
    2025-12-03T02:14:17.7533333+00:00

    Hello VPHAN,

    Thank you for you kind reply and followup. I have tried your advice as follows -

    Network threshold and delay defaults were -

    CrossSubnetDelay - 1000 , CrossSubnetThreshold - 20 , PlumbAllCrossSubnetRoutes : 0 (Unable to understand this), SameSubnetDelay - 1000, SameSubnetThreshold - 10

    Now I have modified as follows -

    CrossSubnetDelay : 400, CrossSubnetThreshold : 10, PlumbAllCrossSubnetRoutes : 0, SameSubnetDelay : 250, SameSubnetThreshold : 5

    I have checked the MPIO disks and results are -

    MPIO Disk1: 02 Paths, Round Robin with Subset, Implicit Only

    Controlling DSM: Microsoft DSM
    
    SN: 600C0FF00074BCEC5BF62B6901000000
    
    Supported Load Balance Policies: FOO RRWS LQD WP LB
    
    Path ID          State              SCSI Address      Weight
    
    ---------------------------------------------------------------------------
    
    0000000077020000 Active/Unoptimized 002|000|000|002   0
    
      TPG_State : Active/Unoptimized, TPG_Id: 1, : 5
    
    0000000077010000 Active/Optimized   001|000|000|002   0
    
    * TPG_State : Active/Optimized  , TPG_Id: 0, : 1
    

    Disk 0 is also showing the same which I have not mentioned but tallied.

    Stopped the VM checkpoints also.

    Currently failover time is 3 min 53 seconds in case of power off but when I am shutting down the server through START menu, failover is taking place in 5 seconds (before above modification), Live migration of VM taking place within 5 seconds and if I stop the cluster service through Cluster Manager, failover is taking place within 5 seconds.

    When I am powering off, failure detection is taking place within 10 seconds.

    I think after sudden power off, Windows Cluster is completing its own part before failover, which is taking time.

    Please advice.

    Thanks in advance.

    Regards

    Somnath

    0 comments No comments

  4. VPHAN 10,000 Reputation points Independent Advisor
    2025-12-03T04:35:26.4833333+00:00

    Hi Somnathndy,

    You have pinpointed the exact behavior that differentiates a "Graceful" failover (Shutdown/Stop Service) from a "Dirty" failover (Power Loss).

    The fact that your failover time is hovering around 3 minutes and 53 seconds (very close to 4 minutes) is the key. You are not facing a hardware or network issue anymore; you are fighting a default Feature of Windows Server 2016/2019/2022 called VM Compute Resiliency.

    When you perform a graceful shutdown, the Cluster knows the node is going away and immediately moves the roles. However, when a node disappears (Power Off), the Cluster enters a "Transient Failure" mode. By default, Windows waits 240 seconds (4 minutes) for the node to come back online before it actually restarts the VMs on the surviving node. This is designed to prevent "flapping" (VMs moving back and forth unnecessarily).

    Your "Failure Detection" happens in 10 seconds (as you noted), but the Cluster then puts the VM into an "Unmonitored" or "Isolated" state and waits out the timer.

    To get the fast failover you want during a power cut, you need to reduce this timer from the default 240 seconds to something much lower (e.g., 10 or 20 seconds).

    1. Check the current settings: Open PowerShell as Administrator on one of the nodes and run:

    Get-Cluster | fl ResiliencyLevel, ResiliencyPeriod
    

    2. Reduce the timer: Run the following command to lower the wait time to 20 seconds. This tells the cluster: "If a node vanishes, wait 20 seconds. If it's not back, move the VMs immediately."

    (Get-Cluster).ResiliencyPeriod = 20

    Note: You generally want this slightly higher than your boot time if you expect random reboots, but for a true HA failover test, 20-30 seconds is appropriate.

    A Note on Your Network Tuning

    I noticed you set your SameSubnetDelay to 250ms. Caution: This is extremely aggressive. While it detects failure fast, it leaves you vulnerable to "false failovers." If your network switch has a momentary 300ms hiccup (broadcast storm, spanning tree convergence), your entire cluster will crash.

    => Recommendation: Set SameSubnetDelay back to 500 or 1000 (Default).

    Your 3m 53s delay was not caused by the heartbeat; it was the Resiliency Period. Tightening the heartbeat to 250ms gains you milliseconds but adds significant instability risk.

    VPHAN

    0 comments No comments

  5. VPHAN 10,000 Reputation points Independent Advisor
    2025-12-08T04:09:17.9933333+00:00

    Hi Somnathndy,

    I've just taken a look at your previous respond and changed the approach. Please try it and confirm if it works or not. Nice day!

    VP

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.