Share via

ReFS filesystem not responding periodically

extratype 1 Reputation point
2022-01-21T09:34:17.11+00:00

Environment:

  • Windows 11
  • Storage Spaces, Simple (no parity), Thin provisioning
  • Virtual disk larger than 100 TB
  • 10's of TBs used
  • ReFS 3.7 formatted in Windows 11

The filesystem is not responding for about 1 minute, for every ~25 minutes according to the Microsoft-Windows-ReFS/Operational event log.

Message:
An IO took more than 30000 ms to complete.

Event ID: 147
Process Id: (any process accessing the volume)
Process name:
File name: (any file)
File offset:
IO Type: (Read, Write, Open, etc.)
IO Size: (varies)
Latency: (30~64 seconds)

A thread in the System process has ~100% core usage while the filesystem is hanging.
I took a stack trace using Process Hacker:

ntoskrnl.exe!KiDeliverApc+0x1b6
ntoskrnl.exe!KiApcInterrupt+0x328
ReFS.SYS!CmsRotatingSkipList<_RANGE,SmsAllocationRegionEx,OrderByStartOfRange,RegionLockPolicies>::InlineRebalance+0x98
ReFS.SYS!CmsRotatingSkipList<_RANGE,SmsAllocationRegionEx,OrderByStartOfRange,RegionLockPolicies>::Add+0x1d0
ReFS.SYS!CmsAllocator::SplitRangeOnlyRegion+0x386
ReFS.SYS!CmsAllocator::PinBitmapRegion+0x3176c
ReFS.SYS!CmsAllocator::TryProtectRangeIfDurablyFree+0x9b
ReFS.SYS!CmsThinProvisioning::UnmapWorkItemMethod+0x2e4
ReFS.SYS!<lambda_9a1cd484752ca8e9f6e914453bf80744>::<lambda_invoker_cdecl>+0x15
ReFS.SYS!MspWorkerRoutine+0x46
ntoskrnl.exe!ExpWorkerThread+0x14f
ntoskrnl.exe!PspSystemThreadStartup+0x55
ntoskrnl.exe!KiStartSystemThread+0x34
Windows for business | Windows Client for IT Pros | Devices and deployment | Other
Windows for business | Windows Server | Storage high availability | Other
Windows for business | Windows Client for IT Pros | User experience | Other
0 comments No comments

2 answers

Sort by: Most helpful
  1. Limitless Technology 40,101 Reputation points
    2022-01-24T15:50:33.69+00:00

    Hi there,

    DPM uses loopback-mounted-VHDs. These appear like normal disks to the OS. Therefore, these disks are displayed in Windows Explorer, Diskmgt, and other GUI tools. These tools periodically poll the disks to make sure that they are functioning correctly.

    This causes IOs to be sent down the loopback stack to the ReFS volume. If the ReFS volume is busy, these IOs will have to wait. Therefore, when ReFS performs a long-duration operation, such as flushing or a large block-clone call, these IOs will have to wait longer.

    Here are some links to help you out

    ReFS volume using DPM becomes unresponsive on Windows Server 2016
    https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/refs-volume-dpm-unresponsive

    ---------------------------------------------------------------------------------------------------------------------

    --If the reply is helpful, please Upvote and Accept it as an answer--

    Was this answer helpful?

    0 comments No comments

  2. Chris 656 Reputation points
    2022-01-21T09:42:46.79+00:00

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.