PerfView Hard Core CPU Investigations (Using CPU Counters) on Windows 10

One of the bread-and-butter investigations that people use the PerfView profiling tool for is a CPU investigation.   Indeed when you type 'PerfView collect' it collects the information needed for a CPU sampling information needed for such an investigation.   This CPU information consists of a stack taken every millisecond on each processor on the machine.   This gives you a very good idea where you are spending your CPU time.

This is all 'standard' stuff.   It turns out that the CPU chips have the ability do do sampling based on other things besides time (1 msec intervals).   The most useful of these are

  • After N instructions were executed (Instructions Retired)
  • After N cache misses (CacheMisses, LLCCachMiss, LLCCacheReference (LLC = Last Level Cache.  The 'biggest' cache before you get to DRAM memory)
  • After N branch mispredictions (the processor is always guessing (based on previous experience) which way a branch goes, and this is how often it gets it wrong).

First why would you care?  Don't we care about time anyway?    And the answer is yes, we mostly care about time, but sometimes it is hard to understand why the time is being spent the way it is, and this extra information helps you figure out why.   This is where CPU counters come in.

These 'CPU Counters' are most useful when the important part of your program is CPU bound, and you have picked the best algorithms already, so now it is about having the CPU execute these instructions as fast as possible.    Basically you are looking for 'slow' instructions to see if you can speed them up.   Instructions can be slow because

  • It guessed wrong on which way a branch would push execution.
  • It needs data that is not in the first level cache (a cache miss).

These CPU counters can add valuable insight in figuring out why instructions are slow.  Thus we still care about time, but this extra information can be helpful.

For a while now PerfView could collect these CPU counters.   They are accessed via the 'Ctrs' button in the 'Advanced Options' section of the PerfView collection dialog box.    However until recently this was not very useful because of two limitations

  1. These CPU counters were not available in a Virtual Machine (thus if you were running in the cloud you could not get them) and
  2. If you had windows 'HyperVisor' on (that is your machine could host another virtual machine).  CPU Counters were also unavailable.

Well, as of Windows 10 RS5 release, these restrictions have been lived.  Thus it is easier get at them.    Hurray!

Accessing them works as it always did.   Basically the 'Cpu Ctrs' text box is a list of 'CounterName:Number' pairs.   Thus putting InstructionRetired:1000000 asks the machine to log a stack every time 1 million instructions have been executed.      The helpful 'Ctrs' button opens a drop-down menu that shows all the counters that are available on your machine, and a good count for that particular counter (you want to pick a number so that you get about 1000 of these stacks a second.   Sometimes you have to try a value, see what happens for your scenario and then change it so the result is in that range.   Too high and the overhead is too high.   Too low and you don't get the detail you want.  Typically 1000/sec is a good compromise.

Looking at the data:

Once you added things to the 'Cpu Ctrs' textbox, then when you view the resulting data you should have new events called PMCSample (PMC = Precise Machine Counter).   You view them in the 'Advanced Group -> Any Stacks' view.  This view shows events of all types so typically you want to filter to just the PMCSample events (right click -> Include Item).  This shows you were that counter metric is being 'spent' (e.g. what methods are executing many instructions, or causing many Cache misses).

Multiple Counters:

You CAN have up to 4 counters logging simultaneously, but I would start off collecting them individually.  The reason is that all the 4 counters log the same PMCSample event.  There is a 'Profile Source' field on the event (it is a small integer value) that uniquely identifies the particular counter (e.g. 26 == InstructionsRetired),  and PerfView DOES show this number, but PerfView is currently not particularly friend about telling you the mapping from the number to the name.   (you have to look in PerfViews log (lower right corner) to find the mapping from the name to the number).   If you only turn on one, you don't have this problem.

Good Uses for the Counters

Believe it or not the 'InstructionsRetired'  is one of the more useful CPU counters.   The reason is that it is CPU bound code can be slow because either

  1.  Many instructions are being executed OR
  2. The instructions are slow (Because of cache misses or whatever).

The first issue is very straightforward (there is a very clear relationship between the code the number of instructions it executes), but the second is subtle (it depends typically on how 'local' your data is).   Because the the solutions for these two are very different, you want to know which one it is.  InstructionsRetired can do this.

For example suppose  you have a 10% TIME regression on some CPU bound benchmark.   Well you could collect InstructionsRetired data before and after, and see if you have 10% more instructions being executed, or whether you have about the same number of instructions (they are just slower).  This tells you whether you algorithm is simply running more code, or whether it is likely touching memory in a way that it is inefficient.

This is ESPECIALLY helpful for MULTIPROCESSOR algorithms.    It is very easy for many processors acting on the same data to 'interfere' with each other (slowing things down).  Before you can know that that is the problem, you have to rule out the 'easy' case that something is just running more instructions.

Once you determine that indeed you are not executing more instructions but they are slower, the stacks show you which methods.   You can then collect other CPU counters like DCache miss rates, and BranchMisprediction data (again before and after) to see what has changed.   This lets you figure out how best to improve the situation.

So there you have it.   If you have a hard core CPU problem to optimize, having CPU counters is very helpful, and now with the improvements in Windows 10 RS5, it is easy to collect the data.