Troubleshooting Pool Leaks Part 3 – Debugging
In our previous articles we discussed identifying a pool leak with perfmon, and narrowing the source of the leak with poolmon. These tools are often preferred because they are easy to use, provide verbose information, and can be run on a system without forcing downtime. However, it is not always possible to get perfmon and poolmon data. If a system is experiencing poor performance you may have a business need to get the system up and running as quickly as possible without allowing time to troubleshoot. It is also possible to completely exhaust memory through a pool leak, leaving the system in a state where tools such as perfmon and poolmon will not work. In these scenarios it may be possible to troubleshoot the poor performance by forcing a bugcheck, gathering a memory dump, and performing a post mortem analysis.
Although a dump is not the ideal data to troubleshoot a leak, it can be done. I say less than ideal because a dump is a snapshot of the system memory, and does not provide the historical data which perfmon would provide. The lack of historical data makes it difficult to differentiate between high memory usage due to load and high memory usage due to a leak. It is up to you, as the troubleshooter, to determine if the dump is sufficient evidence of a leak. Sometimes identifying the tag and the driver will help you identify a known issue that causes a leak, or your knowledge of the driver architecture may allow you to determine if the memory usage is normal or not. In some scenarios you may decide to start monitoring with perfmon and collect additional data for a future occurrence.
The first step to debug a pool leak using a dump is to load the dump in windbg, set the symbol path, and reload symbols.
1: kd> .symfix c:\symbols
1: kd> .reload
Loading Kernel Symbols
...............................................................
................................................................
.....
The !vm command will show memory utilization, the 1 flag will limit the verbosity of this command. For the scenario of a pool leak, the significant values are “NonPagedPool Usage:” and “PagedPool Usage:”. If the debugger identifies a value that is out of the normal range it will flag it, and we can see here that the debugger has flagged excessive nonpaged pool usage. This is similar to the information we obtained in Part 1 using perfmon, but unlike perfmon we do not have trend data to indicate if this is temporary high pool usage due to load or if this is a leak.
1: kd> !vm 1
*** Virtual Memory Usage ***
Physical Memory: 403854 ( 1615416 Kb)
Page File: \??\C:\pagefile.sys
Current: 1048576 Kb Free Space: 1015644 Kb
Minimum: 1048576 Kb Maximum: 4194304 Kb
Available Pages: 106778 ( 427112 Kb)
ResAvail Pages: 225678 ( 902712 Kb)
Locked IO Pages: 0 ( 0 Kb)
Free System PTEs: 33533355 ( 134133420 Kb)
Modified Pages: 4844 ( 19376 Kb)
Modified PF Pages: 4838 ( 19352 Kb)
NonPagedPool Usage: 155371 ( 621484 Kb)
NonPagedPool Max: 191078 ( 764312 Kb)
********** Excessive NonPaged Pool Usage *****
PagedPool 0 Usage: 27618 ( 110472 Kb)
PagedPool 1 Usage: 3848 ( 15392 Kb)
PagedPool 2 Usage: 299 ( 1196 Kb)
PagedPool 3 Usage: 283 ( 1132 Kb)
PagedPool 4 Usage: 344 ( 1376 Kb)
PagedPool Usage: 32392 ( 129568 Kb)
PagedPool Maximum: 33554432 ( 134217728 Kb)
Session Commit: 7764 ( 31056 Kb)
Shared Commit: 6371 ( 25484 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 5471 ( 21884 Kb)
PagedPool Commit: 32394 ( 129576 Kb)
Driver Commit: 2458 ( 9832 Kb)
Committed pages: 326464 ( 1305856 Kb)
Commit limit: 665998 ( 2663992 Kb)
The debugger can parse the pool tagging database and present similar information as poolmon. The !poolused command will do this, the /t5 option will limit output to the top 5 consumers, the 2 flag will sort by nonpaged pool usage (use the 4 flag if your leak is in paged pool).
1: kd> !poolused /t5 2
..
Sorting by NonPaged Pool Consumed
NonPaged Paged
Tag Allocs Used Allocs Used
Leak 601 615424000 0 0 UNKNOWN pooltag 'Leak', please update pooltag.txt
Pool 6 1717840 0 0 Pool tables, etc.
nVsC 664 1531552 0 0 UNKNOWN pooltag 'nVsC', please update pooltag.txt
netv 4369 1172224 1 144 UNKNOWN pooltag 'netv', please update pooltag.txt
Thre 607 774048 0 0 Thread objects , Binary: nt!ps
TOTAL 43424 634209952 63565 126487760
The above output shows that the tag “Leak” is associated with almost all of the nonpaged pool usage. This is the same information we obtained in Part 2 using poolmon.
Now we must identify what drivers use the pool tag “Leak”. Because we have a snapshot of the system memory we can search the dump for this tag. We can match each address to a module using the command lm a.
1: kd> !for_each_module s -a @#Base @#End "Leak"
fffff880`044b63aa 4c 65 61 6b 3b c1 0f 42-c1 41 8d 49 fd 8b d0 ff Leak;..B.A.I....
fffff880`044b6621 4c 65 61 6b 3b c1 0f 42-c1 33 c9 8b d0 ff 15 cc Leak;..B.3......
1: kd> lm a fffff880`044b63aa
start end module name
fffff880`044b5000 fffff880`044bc000 myfault (no symbols)
1: kd> lm a fffff880`044b6621
start end module name
fffff880`044b5000 fffff880`044bc000 myfault (no symbols)
The tag and driver name can be used to search the internet for known problems. If a known issue is found a driver update may be available, and installing this update may prevent a future memory leak.
If there are no updates available for the driver, or if this is your driver and you need to identify the cause of the leak, don’t panic. In future articles we will show techniques for getting call stacks of pool allocations, these call stacks be used to identify under what conditions the driver leaks memory.
Comments
Anonymous
August 31, 2012
There may be also scenarios where you don't have system prepared for manual memory dump capture and you need to collect the dump file. In cases like this, LiveKd from Sysinternals is a great tool of choice. The latest "mirror dump" (-m) parameter for Vista and above is simply awesome as it solves some drawbacks of using this tool (data consistency etc.). Regarding !vm in windbg and the "excessive nonpaged pool usage" - there is one known bug in kdexts.dll (support.microsoft.com/.../2509968) which causes the extension to report some non-realistic numbers. This may lead to false positives for users with little debugging/memory management experience.Anonymous
April 27, 2015
Great article - to the point. Helped me a lot in debugging a buggy pool allocation.