Troubleshooting Pool Leaks Part 1 – Perfmon
Over the years the NTDebugging Blog has published several articles about pool memory and pool leaks. However, we haven’t taken a comprehensive approach to understanding and troubleshooting pool memory usage. This upcoming series of articles is going to tackle pool leaks from the basics to advanced troubleshooting techniques. Most of the examples will use the Windows Sysinternals tool NotMyFault to generate a leak so our readers will be able to reproduce the described behavior and repeat the troubleshooting steps.
We need to start by understanding what pool is and how it is used. Pool is virtual memory that is used by drivers in much the same way user mode applications use heap. A driver developer calls ExAllocatePoolWithTag to get a block of memory that can be used in much the same way a user mode programmer would use memory returned by HeapAlloc or malloc. The memory manager, which is responsible for managing pool, is able to efficiently handle small allocations by taking a page of virtual memory (typically 4KB) and breaking it up into smaller blocks. The memory manager is also able to allocate pool in blocks larger than a page. There are two types of pool a developer can request from ExAllocatePoolWithTag, paged pool and nonpaged pool. As the names suggest one type of pool memory can be paged out, and the other cannot be paged. Paged pool is used for most allocations, nonpagedpool is used for memory that will be written or read at an IRQL of DISPATCH_LEVEL or above.
Pool leaks happen when a driver calls ExAllocatePoolWithTag but never calls the corresponding ExFreePool or ExFreePoolWithTag routine. A leak is different than just high memory utilization, which may happen in normal conditions as load increases. For example, the srv.sys driver creates work items for incoming requests, and when there is a large amount of SMB traffic to a server the pool usage from srv.sys may increase to handle this traffic. Typically the differentiation between a leak and high memory usage due to load is that a leak never decreases. Memory usage that is load related should decrease when the load is reduced. Monitoring is required to differentiate between these two scenarios. Performance Monitor (aka perfmon) is typically the most effective tool to begin such an investigation.
The symptom of a pool leak is often poor system performance when the system runs out of pool, or on 64-bit systems the pool may begin to consume a large amount of the available memory. This symptom makes perfmon an ideal tool to begin troubleshooting as it can be used to identify a wide variety of potential causes of poor performance. Perfmon is most useful when it is started before a system enters a state of poor performance so that trend data can be analyzed leading up to the problem.
You can use the below commands from an elevated command prompt to collect perfmon data from such a scenario.
First create the data collector. This command collects data from a variety of counters at a 5 minute interval and is designed to be run for several hours prior to and during a the time a system experiences poor performance (shorter intervals can be used for leaks that happen faster than several hours). We often recommend collecting these counters to perform general performance troubleshooting because we usually don’t know that there is a memory leak until after this data is collected and analyzed.
Logman.exe create counter PerfLog-Long -o "c:\perflogs\\%computername%_PerfLog-Long.blg" -f bincirc -v mmddhhmm -max 300 -c "\LogicalDisk(*)\*" "\Memory\*" "\Cache\*" "\Network Interface(*)\*" "\Paging File(*)\*" "\PhysicalDisk(*)\*" "\Processor(*)\*" "\Processor Information(*)\*" "\Process(*)\*" "\Redirector\*" "\Server\*" "\System\*" "\Server Work Queues(*)\*" "\Terminal Services\*" –si 00:05:00
Then start collecting data:
Logman.exe start PerfLog-Long
When the performance problem is being experienced, stop collecting data:
Logman.exe stop PerfLog-Long
After you have collected the data, open the .blg file in the Performance Monitor MMC snap-in. Browse to the Memory object, and add the counters “Pool Nonpaged Bytes” and “Pool Paged Bytes”. This should leave you with a view similar to the below screenshot.
The steadily increasing line in the above screenshot, without a substantial decrease in the line, is an indicator that nonpaged pool memory is being leaked. If we look at the maximum count we see that nonpaged pool has consumed 540MB. The significance of this value depends on the amount of RAM in the system. In this instance the system has 1GB of RAM so nonpaged pool is consuming 54% of the available memory. We can now conclude that the cause of the performance problem is a nonpaged pool memory leak, which is consuming a large amount of RAM and preventing other components from using this RAM.
Next we need to start investigating which driver has allocated the most pool. We will begin that in part 2.
Comments
Anonymous
July 31, 2012
The comment has been removedAnonymous
July 31, 2012
Good article when is part 2 coming out [Thanks. Probably later this month.]Anonymous
July 31, 2012
Looking forward to the entire series!Anonymous
August 01, 2012
"Have you seen our article on tracking down MmSt usage?" I don't think I have tried that. Thanks for the pointer!Anonymous
August 02, 2012
@LeoDavidson Here is some additional stuff which you may find helpful: blogs.technet.com/.../getting-to-know-the-mmst-pool-tag.aspx Speaking about pool leak scenarios - you have a big advantage here because you can reproduce the error anytime, as far as I understood. This dramatically speed up your problem resolution. Good luck!Anonymous
September 02, 2012
@Leo use xperf to see what casues the MmSt usage. Run xperf -on BASE+Pool –stackwalk PoolAlloc+PoolFree -buffersize 1024 -MaxFile 512 -FileMode Circular && timeout -1 && xperf -d trace_pool_usage.etl and open the trace_pool_usage.etl in xperfview, look at the pool usage graphs, load symbols and look in the summary table what is the callstack for the MmSt usage.Anonymous
January 17, 2013
note One needs to place square brackets around the interval at the end of the command. i.e [00:05:00] on server 2003. [Good catch, thanks.]