Hi Everyone,
I am facing quit headeach issue, took many hours for google seaching but still can not fix. It happing in our Production Hyper-V enviroment so impact quite high ultil now.
I have 2 Hyper-V Clusters (2 nodes each), both running Windows Server 2022 Enterprise. Issue happening on both cluster with same symtom.
Let me describe detail setting of each cluster:
- Node: 2 nodes
- Network on each node 2 NICs for Production network: Client & cluster use allowed. 2 NICs for Backup network (no Gateway): No Cluster & client 2 NICs for heartbeath & Live migration: Cluster only use
- Cluster storage 1 Disk Witness 4 Cluster Share Volumes
- Virtual Machines All VM will store OS disk on CSV Some VM using Virtual HBA to connect with external storage. 54 VMs each cluster
Clusters seem worked fine after created with less of VMs. But after we created many VMs (Migrated from other old clusters) the problem come.
Problem 1:
- For 4 months ago, cluster node can't access to other node in cluster
Example: Node 02b can't access to node 02a via WRM. Only one way to fix until now is reboot both nodes. But problem will happen again after around one month.
- No firewall blocked between nodes
Problem 2:
- CSVs randomly "paused" (one per month) and all VMs "paused csv" will hang and can't not Live/quick migrate to other node. The only one way to temporary over come this is turn off cluster service of any node of cluster, keep remaing node running and everything fine (without HA). If we turn on cluster service on both node, the issue happen again and again.(look like the "split brain" happened but i still not sure)
These are some clusterLog lines:
[System] 00003ad4.00003e10::2023/12/15-10:17:13.493 WARN Cluster Shared Volume 'Volume3' ('Cluster Disk 2') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished. This error is usually caused by an infrastructure failure. For example, losing connectivity to storage or the node owning the Cluster Shared Volume being removed from active cluster membership.
`Line 5752: [System] 00003ad4.00003e10::2023/12/15-10:17:13.505 WARN Cluster Shared Volume 'Volume2' ('Cluster Disk 3') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
```` Line 23330: [System] 00003b3c.0000436c::2024/01/25-08:08:20.510 WARN Cluster Shared Volume 'Volume4' ('Cluster Disk 5') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
` Line 23331: [System] 00003b3c.0000436c::2024/01/25-08:08:23.865 WARN Cluster Shared Volume 'Volume1' ('Cluster Disk 4') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
` Line 23332: [System] 00003b3c.0000436c::2024/01/25-08:14:22.629 WARN Cluster Shared Volume 'Volume4' ('Cluster Disk 5') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
` Line 23339: [System] 00003b3c.0000436c::2024/01/25-08:18:52.256 WARN Cluster Shared Volume 'Volume1' ('Cluster Disk 4') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
` Line 23340: [System] 00003b3c.0000436c::2024/01/25-08:20:33.189 WARN Cluster Shared Volume 'Volume1' ('Cluster Disk 4') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
` Line 23351: [System] 00003b3c.0000436c::2024/01/25-08:26:01.025 WARN Cluster Shared Volume 'Volume4' ('Cluster Disk 5') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.`
**Problem 3:**
DNS name issue happend randomly after already fixed. No problem when do DNS verify:
- nslookup Forward/Reverse
- Ping via dns name..
`Cluster Network name: 'Cluster Name'`
`Error code: 'DNS bad key.`
`Guidance:`
`Ensure that the network adapters associated with dependent IP address resources are configured with access to at least one DNS server.`
`[System] 00004fb4.000042a8::2023/12/08-10:21:06.625 ERR Cluster network name resource failed to modify the DNS registration.`
**Some other things could be consider.**
- Acronis Backup Software Backup Cluster host & VM (daily incremental, weekly full)
- Bitdefender anti-virus software
- There are no network firewall blocked between node
- Cluster nodes firewall turned off totally.
Cluster log uploaded :
[https://1drv.ms/u/s!AgFJBZyCAor4jYkd859skYMHXyTbXg?e=hyujti](https://1drv.ms/u/s!AgFJBZyCAor4jYkd859skYMHXyTbXg?e=hyujti)
Please give me your suggestion and ideas to fix this issue based on your experences and knowledge with big thanks
Br,
Hieu