ReFS filesystem not responding periodically

extratype 1 Reputation point
2022-01-21T09:34:17.11+00:00

Environment:

  • Windows 11
  • Storage Spaces, Simple (no parity), Thin provisioning
  • Virtual disk larger than 100 TB
  • 10's of TBs used
  • ReFS 3.7 formatted in Windows 11

The filesystem is not responding for about 1 minute, for every ~25 minutes according to the Microsoft-Windows-ReFS/Operational event log.

Message:
An IO took more than 30000 ms to complete.

Event ID: 147
Process Id: (any process accessing the volume)
Process name:
File name: (any file)
File offset:
IO Type: (Read, Write, Open, etc.)
IO Size: (varies)
Latency: (30~64 seconds)

A thread in the System process has ~100% core usage while the filesystem is hanging.
I took a stack trace using Process Hacker:

ntoskrnl.exe!KiDeliverApc+0x1b6
ntoskrnl.exe!KiApcInterrupt+0x328
ReFS.SYS!CmsRotatingSkipList<_RANGE,SmsAllocationRegionEx,OrderByStartOfRange,RegionLockPolicies>::InlineRebalance+0x98
ReFS.SYS!CmsRotatingSkipList<_RANGE,SmsAllocationRegionEx,OrderByStartOfRange,RegionLockPolicies>::Add+0x1d0
ReFS.SYS!CmsAllocator::SplitRangeOnlyRegion+0x386
ReFS.SYS!CmsAllocator::PinBitmapRegion+0x3176c
ReFS.SYS!CmsAllocator::TryProtectRangeIfDurablyFree+0x9b
ReFS.SYS!CmsThinProvisioning::UnmapWorkItemMethod+0x2e4
ReFS.SYS!<lambda_9a1cd484752ca8e9f6e914453bf80744>::<lambda_invoker_cdecl>+0x15
ReFS.SYS!MspWorkerRoutine+0x46
ntoskrnl.exe!ExpWorkerThread+0x14f
ntoskrnl.exe!PspSystemThreadStartup+0x55
ntoskrnl.exe!KiStartSystemThread+0x34
Windows Hardware Performance
Windows Hardware Performance
Windows: A family of Microsoft operating systems that run across personal computers, tablets, laptops, phones, internet of things devices, self-contained mixed reality headsets, large collaboration screens, and other devices.Hardware Performance: Delivering / providing hardware or hardware systems or adjusting / adapting hardware or hardware systems.
1,544 questions
Windows Server Storage
Windows Server Storage
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Storage: The hardware and software system used to retain data for subsequent retrieval.
631 questions
Windows 11
Windows 11
A Microsoft operating system designed for productivity, creativity, and ease of use.
8,187 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Chris 651 Reputation points
    2022-01-21T09:42:46.79+00:00
    0 comments No comments

  2. Limitless Technology 39,356 Reputation points
    2022-01-24T15:50:33.69+00:00

    Hi there,

    DPM uses loopback-mounted-VHDs. These appear like normal disks to the OS. Therefore, these disks are displayed in Windows Explorer, Diskmgt, and other GUI tools. These tools periodically poll the disks to make sure that they are functioning correctly.

    This causes IOs to be sent down the loopback stack to the ReFS volume. If the ReFS volume is busy, these IOs will have to wait. Therefore, when ReFS performs a long-duration operation, such as flushing or a large block-clone call, these IOs will have to wait longer.

    Here are some links to help you out

    ReFS volume using DPM becomes unresponsive on Windows Server 2016
    https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/refs-volume-dpm-unresponsive

    ---------------------------------------------------------------------------------------------------------------------

    --If the reply is helpful, please Upvote and Accept it as an answer--

    0 comments No comments