From the November 2001 issue of MSDN Magazine.

MSDN Magazine

NET CLR Profiling Services: Track Your Managed Components to Boost Application Performance

Anastasios Kasiolas

This article assumes you're familiar with COM, C++, and the .NET Framework

Level of Difficulty     1   2   3 

Download the code for this article: NetProf.exe (41KB)

SUMMARY The Microsoft .NET platform provides you with a rich set of services for building profilers and application monitors for applications targeting the Common Language Runtime (CLR). These services expose runtime events that occur during the execution of a .NET application. They can be used to obtain information about managed code being executed under the runtime.
      This article describes the .NET CLR Profiling Services and shows how to use the services to build a simple profiler that will provide hot spot information for any .NET application. The sample profiler can easily be modified to suit other profiling and monitoring needs.

The Microsoft® .NET Framework provides a wealth of services for building debuggers, symbol browsers, application monitors, and profilers for applications targeting the Common Language Runtime (CLR). A profiler can help you write better, more efficient, and faster .NET code. It can also help tune existing applications and focus your efforts on the sensitive areas that show problematic performance. In this article I'll describe the design of a CLR profiler, reveal some common pitfalls, and offer implementation hints. The sample code, which can be downloaded from the link at the top of this article, is a CLR profiler that identifies hot spots in a .NET application. The code is simple to read and expandable. I encourage you to start using the CLR Profiling API, experimenting with it, and using it for your everyday tasks.
      I will assume that you are familiar with the .NET platform and understand the difference between managed and unmanaged code. If not, you should take a look at Jeff Richter's recently published MSDN Magazine articles for an excellent introduction to the .NET platform (see the .NET Article Index ). This article doesn't assume that you are already familiar with the .NET CLR Profiling Services. However, I recommend that you read the CLR Profiling documentation that is included in the .NET Framework SDK.
      Please note that the information contained in this article is based on prerelease software and is subject to change.

.NET CLR Profiling Services

      Suppose you have an application that contains managed (.NET) and unmanaged components. To acquire a complete and accurate picture of what takes place during the execution of the application, you need to gather information for both the managed and the unmanaged pieces. The traditional ways to instrument unmanaged code and collect profiling information are not easily applied to the .NET components, and they would require you to have some knowledge of the internal workings of the CLR. Fortunately, the .NET CLR Profiling Services provide support for this part of the puzzle and make it fairly easy to find out a lot of what is going on with your .NET application. The .NET CLR Profiling Services consist of two COM interfaces:ICorProfilerCallback and ICorProfilerInfo.
      The ICorProfilerCallback interface provides a mechanism for the CLR to notify a profiler when interesting events occur. If you are going to use the CLR Profiling Services to implement a profiler or application monitor for .NET programs, you must provide an implementation for the ICorProfilerCallBack interface. The set of notifications provided by the callback interface, rather the CLR, are listed in Figure 1.
      In contrast, the ICorProfilerInfo interface is implemented by the CLR and is used to provide supplemental information. This supplemental information includes, but is not limited to, names of application domains, assemblies, managed modules and classes, size of objects and IDs, and handles to managed threads. A much more detailed description of both interfaces can be found in the profiling specification.
      To illustrate how the two interfaces work together, suppose that a profiler is active and a notification for a managed to unmanaged code transition is received via the callback ICorProfilerCallback::ManagedToUnmanagedTransition. In addition, suppose you want to find out on which managed thread the transition is occurring. The parameters of the callback don't provide that information. This is where the ICorProfilerInfo interface becomes useful (see Figure 2).
      Notice that the profiler directly calls ICorProfilerInfo::GetCurrentThreadID using a pointer to an instance of the ICorProfilerInfo interface (m_pProfilerInfo). Note that a CLR profiler can get an instance of ICorProfilerInfo only from the ICorProfilerCallback::Initialize callback.

Enabling the .NET CLR Profiling Services

      Profiling under the CLR is enabled on a per-process basis. For that purpose, it uses two environment variables: Cor_Enable_Profiling and Cor_Profiler. Cor_Enable_Profiling explicitly tells the CLR that every process should be profiled and that a profiler needs to be instantiated when the process starts up. Cor_Profiler specifies which one of the registered profilers will be used. Here's an example of how to use these variables:

  set Cor_Enable_Profiling=0x1
set Cor_Profiler={FA0C6E77-AFF4-4c74-95F9-6216AFDE992C}


      It is possible to have more than one profiler registered at the same time. Every profiler has to be previously registered and must provide a unique GUID. You can have multiple profilers registered, but the one used will be uniquely specified by its GUID in the Cor_Profiler environment variable. Also notice that it is possible to have different profilers enabled in different environments simultaneously. For more details on how to do this, see the Help section in recent releases of the .NET Framework SDK under "Enabling Profiling and Debugging."
      Let's see what is actually happening behind the scenes. The profiler must be implemented as an inproc COM server, which is basically a DLL mapped into the same address space as the process that is being profiled. When Cor_Enable_Profilng is set, the CLR creates an instance of the profiler by calling CoCreateInstance for the profiler specified by Cor_Profiler. This is why the profiler DLL needs to be previously registered.
      Subsequently, the CLR sends the very first notification to the newly created server, which is nothing but an ICorProfilerCallback::Initialize callback. This is the point where the profiler has to specify which events it wants to monitor and get an ICorProfilerInfo interface pointer that it can use during the program's execution to acquire additional information, if needed. Limiting the number of events I want the profiler to monitor means I can build a simpler, specialized profiler and at the same time reduce the amount of CPU time that the CLR spends in sending notifications that I am not really interested in. These fine points will become clearer in the following sections.
      At this point, you should have a clearer idea about what the .NET CLR Profiling Services are, how CLR profiling can be enabled, and what is actually taking place when a managed application being profiled starts executing. In the next section I'll present some interesting, real-life scenarios where CLR profiling would be a valuable asset in a developer's arsenal.

Interesting CLR Profiling Scenarios

      There are numerous things that a CLR profiler can provide information about, as described in Figure 1. Depending on the type of the application and which aspect of the CLR operation you are interested in, a tool based on the .NET CLR Profiling Services offers the necessary insight into the CLR's operation and execution. Let's look at some scenarios that could be implemented using the CLR Profiling API.
      .NET CLR Profiling Services can maintain a shadow stack of a managed application. In this case, the profiler needs to receive the enter-leave function events and transition events from the CLR and then organize them on a per-thread basis. This information can be used at the end of the execution for different purposes. For example, the profiler can produce a call graph or print information about managed exceptions or code transitions. The key to remember is that the profiler has to ensure thread safety when accessing its data members and its internal data structures, since the profiling API does not offer any guarantees.
      You can use .NET CLR Profiling Services to change the Microsoft Intermediate Language (MSIL) of a managed function. One of the most powerful capabilities of the profiling API is the ability to modify the IL code of a managed method before the just-in-time (JIT) compiler compiles it. The profiler accesses the MSIL of a method, alters it, and then asks the JIT compiler to consume the new code. This scenario can be used for instrumentation of a managed application if the profiler inserts the appropriate hooks into every method.
      The inserted code can be something as simple as increasing a counter or making a call to a native or even a managed function. In fact, this is the only point where a profiler is allowed to make calls into managed code. In case you make a function call, you should pay attention to the calling convention because if there is stack corruption the JIT compiler can crash. The profiler can modify the MSIL before or after the compilation and ask the CLR to recompile the specific method. A direct call into managed code from any other profiler callback, or when using any other mechanism, could result in a hang or crash of the execution engine, and therefore should be avoided.
      One of the most interesting mechanisms of the CLR is the automatic disposal of dead object references, known as garbage collection (GC). .NET CLR Profiling Services can monitor garbage collection and object allocation. The profiler receives a notification not only when an object gets allocated, but also when a GC takes place. An interesting scenario would be to track how many objects get allocated per class and which objects are surviving every GC. In this way, the profiler is able to build an object graph during the program's execution.
      Managed exception handling is one of the most complicated mechanisms of the CLR, and .NET CLR Profiling Services can also monitor managed exceptions. Exceptions during the execution of a program are always time-consuming, and a profiler should be able to report them when they occur. The profiling API offers an extensive set of callbacks that depict in detail the search, unwind, and finally phases of the exception handling cycle. An interesting profiling scenario would monitor managed exceptions and provide additional information such as the Win32 thread that threw the exception, the function that was on the top of the stack, the arguments of that function, and the locals that were in scope at the point where the exception occurred. Additional information could include the functions involved in the search, unwind, and finally phases. In order to implement this scenario, you should use the in-process debugging API and the unmanaged symbol store API. For more details on their usage, you should check the .NET Framework SDK documentation.
      .NET CLR Profiling Services can monitor context boundary crossing scenarios. Remoting calls are time-consuming and sometimes it is very important to know when they are taking place while running an application. Important parameters to monitor in remoting include:

  • Round-trip time of a request (synchronous or asynchronous)
  • Request service time versus load (synchronous or asynchronous)
  • Idle time for client (while waiting for reply)

      .NET CLR Profiling Services can also be used to identify hot spots in your application—namely, where most of the execution time is spent. As mentioned before, improving application performance is a key reason to use profiling. The first step of that process is to identify which part of the application is performing poorly and investigate how things can be done more efficiently. An interesting profiling scenario would be to keep track of the execution time spent in every function and persist that information for later analysis. This task is not trivial since you need to consider the overhead caused by the profiling process, the time spent in the callees of a method, the time that the CLR has suspended a thread, and so on. I'll discuss this scenario in the following sections of this article.

Implementing a CLR Profiler

      Now let's go through a step-by-step tutorial on how to design a CLR profiler. Although I will use the example of the "Hotspots Profiler Tracker" that I've just described, there are some common design steps you need to consider regardless of the nature and functionality of the profiling tool you want to design. The design of a profiler could be broken into two steps. First, implement the COM server component, and then implement the ICorProfilerCallback interface.
      As I walk you through these design steps, I'll point out the pitfalls you may run across when building a profiler from scratch. All the source code for my sample profiler is available at the link at the top of this article. With some minor modifications you can use my example to build your own special-purpose profiler.

Implementing the COM Server Component

      The implementation of the ClassFactory for the COM server is pretty simple. My server is always in-process and I do not need to implement any logic for the LockServer method. The declaration for my class factory is shown in Figure 3, and the implementation can be found in the classfactory.hpp file available in the download.
      Next I implemented DllMain and exported DllGetClassObject, DllRegisterServer, and DllUnregisterServer. The implementation of DllMain is shown in Figure 4. Notice here that I am using special handling when I receive a DLL_PROCESS_DETACH event. I have saved a pointer to the ICorProfilerCallback interface, which is set to NULL when the profiler receives an ICorProfilerCallback::Shutdown event. There are certain cases when the CLR is terminating and shutting down in a peculiar way and the profiler never receives the shutdown event. This situation can result in major leaking of the profiler resources that have been allocated during the execution of the managed application.
      To avoid that pitfall, I detect the boundary conditions and send a pseudo-shutdown event to the profiler. In this way, the profiler can release the memory that was allocated and also flush its data to the output stream. The above mentioned condition can occur when an unmanaged client is making calls to a managed server. The CLR has no knowledge of the state in the unmanaged program and this causes an abrupt shutdown at the end of the execution. Another scenario of abnormal shutdown is when an application throws an unhandled exception that causes the CLR to terminate. In the implementation of DllMain, if the shutdown event arrives as expected, the pointer g_pCallbackObject will be NULL so it skips my handler. The full implementation of the above functions can be found in dllmain.hpp.
      The components I've discussed can be reused no matter what your profiler needs to monitor. The only thing you need to do is assign a new GUID to your profiler by redefining the following:

  // profiler GUID definition
#define PROFILER_GUID "{FA0C6E77-AFF4-4c74-95F9-6216AFDE992C}"
extern const GUID __declspec( selectany ) CLSID_PROFILER =
{ 0xfa0c6e77, 0xaff4, 0x4c74, { 0x95, 0xf9, 0x62, 0x16, 0xaf,
0xde, 0x99, 0x2c } };


Implementing ICorProfilerCallback

      As I mentioned, each profiler has to fully implement the ICorProfilerCallback interface even if it is not using all the events. Therefore, I can further divide this step in two parts: the implementation of the callbacks that I don't need, and the implementation of the callbacks that contain the logic for the tool.
      The first step is easy. All you need to do is provide a default handler for all the possible events that the profiler can receive. A good technique is to keep your ICorProfilerCallback implementation relatively simple and use a helper class as a placeholder for your profiler logic. I have followed this technique here, so you will find all the logic associated with the profiler in ProfilerInfo.cpp. Remember that your ICorProfilerCallback has to implement the pseudo-shutdown method; you should also define and initialize a global pointer to your ICorProfilerCallback. The implementation details are found in ProfilerCallback.cpp and ProfilerCallback.h.
      The second step is the most interesting part of the design phase, and it includes the actual implementation of the tool. First, I decided how to accomplish my objective and which callbacks I needed to monitor. The fewer callbacks I monitor, the less overhead there is for the CLR, so an important part of the design is to choose the optimum set of callbacks. In my case, the objective of the tool is to report execution times for every thread and for every managed function that was executed on that thread. During the execution of a managed application, I chose to monitor and calculate running time, profiler time, and suspended time per thread, while watching inclusive time, callee time, profiler time, and suspended time for each function executed on a thread.
      The profiler time represents the amount of time that was added to the execution of the application because of the profiler. To find the pure execution time, you have to subtract it from the running time. For every function, the profiler and suspended times are calculated when the function is on the top of the call stack for the specific thread. In brief, the tool needs to be able to keep track of the existing threads, the status of the threads (suspended or running), and the current call-stack on a per-thread basis.
      The first objective can be accomplished easily by monitoring thread-related events. The second one is less self-explanatory and it requires monitoring suspend and resume events. Finally, the third one is the most difficult to achieve. In order to successfully keep a call stack you need to disable inlining. As a result, you will get an accurate result of the time you are spending in each function without any "noise" from the inline functions that may be contained in the specific method.
      The next step is to enable enter-leave function events. The enabling of these events takes place in the ICorProfilerCallback::Initialize callback. The profiler API allows you to specify the function pointers that will be invoked every time a method is entered or left. The enter-leave handlers are simply function pointers for increased speed and simplicity. Implementing the hooks for the enter-leave and tail-leave events is shown in Figure 5.
      The actual functions used by the function hooks have to be static. Therefore, in the Initialize callback I invoked ICorProfilerInfo::SetEnterLeaveFunctionHooks in order to set up my enter-leave function notifications properly, as shown in the following code:

  hr = m_pProfilerInfo->SetEnterLeaveFunctionHooks (  
(FunctionLeave *)&LeaveNaked,
(FunctionTailcall *)&TailcallNaked );


      Unfortunately, I am not finished yet. That was the self-evident part. The CLR notifies the profiler when a managed function starts and ends its execution. What about the time spent executing unmanaged code? The enter-leave notifications don't provide any information about unmanaged code. In order to properly update the call stack, I have to correctly identify enter and leave events for unmanaged functions. My resources for that task are the transition callbacks that allow simulation of the desired events. A function-enter event for a native function is equivalent to a managed-to-unmanaged transition with reason COR_PRF_ TRANSITION_CALL. Similarly, a function-leave event is equivalent to an unmanaged-to-managed transition with reason COR_PRF_TRANSITION_RETURN. Therefore, the profiler that wants to accurately monitor the call stack and include information for the unmanaged functions that get executed will have to listen for transition events ICorProfilerCallback::ManagedToUnmanagedTransition and ICorProfilerCallback::UnmanagedToManagedTransition.
      At this point, I am handling any application that will potentially make calls to unmanaged code, but I still have one issue to deal with. How will the enter-leave callbacks behave when an exception occurs? This is a hidden pitfall of the CLR Profiler API. When an exception occurs, during the search and unwind phase the CLR notifies the profiler about which function is currently accessed. A call to ICorProfilerCallback::FunctionUnwindLeave replaces the function-leave or tail-leave callback with the specific managed method. Therefore, to maintain an accurate image of the call stack on a per-thread basis, the ICorProfilerCallback::FunctionUnwindLeave callback is considered to be the equivalent of a function-leave or tail-leave event.
      Notice that ICorProfilerCallback::FunctionUnwindLeave does not provide the ID of the method that is being unwound. To efficiently use this callback, you need to combine the information provided by the ICorProfilerCallback::FunctionUnwindEnter callback as well. The simplest way to accomplish that is to match every unwind-leave with the last received unwind-enter. The ideal data structure for that purpose is a stack, and that is what the sample in this article uses.
      The profiler event monitor mask is formed as follows:

  m_dwEventMask = (DWORD) (  


      ProfilerCallback.cpp contains detailed comments at the callbacks as to where this logic has been implemented. With the above exception handling information in mind, it should be pretty simple to follow the code. At this point, everything is in place. The profiler is keeping an active call-stack on a per-thread basis and it monitors the birth, death, and suspension-resume events for each thread. Next I will explain how to implement the timers and update them.

      Earlier, you saw the timers I am monitoring; now I will explain how each of them is calculated. For all the timers, I am using high-precision integers and making use of the QueryPerformanceCounter and QueryPerformanceFrequency Win32 APIs. They are declared in winbase.h, so you need to include windows.h and link against kernel32.lib. I am using one set of timers for each thread and one set of timers for each method executed on a per-thread basis.
      The first timer counts the total time that a thread was alive. To measure this, I capture the time at the point where the thread is created and calculate its total lifespan in the callback that gets destroyed. I am not actually removing the data from the profiler internal data structures; I simply mark the thread as "not alive." The suspension timer starts every time the profiler gets notified that a thread is being suspended and stops when the thread is being resumed. I use one timer to accumulate the total suspended time and another timer to hold the time attributed to the current suspension. The third timer represents the overhead caused by the profiling process.
      To get an accurate picture of my application, I need to take the profiling overhead into consideration and subtract it from the total running time of the thread. Towards that end, I have used this helper class contained in ProfilerInfo.h.

  class TimeTracker

TimeTracker( PrfInfo *pPrfInfo, ThreadID threadID = NULL );


ThreadInfo *m_pThreadInfo;

}; // TimeTracker


The class contains a pointer to the information for the thread that is executing the current callback and uses it to update the necessary counters. It also needs a pointer to the helper class that contains the list of threads and the times for each thread.
      In every callback where the tool is causing some overhead, I create an instance of the TimeTracker class that is a local in the callback. TimeTracker then goes away when I finish the processing of the callback. The constructor takes a snapshot of the current time and the destructor takes another snapshot. The difference between the two represents the time that should be counted as overhead. I am accumulating the profiler overhead from each callback for every thread.
      Here is another tip to keep in mind: after the profiler receives the shutdown callback from CLR, all the notifications are silently turned off. Therefore, the shutdown callback should be considered the final point where all the timers should be finalized and the actual times should be calculated.

      Every thread keeps its own list of methods that it has invoked during its lifetime. When the managed application completes its execution, the tool dumps the results on a per-thread basis. The profiler creates one entry in the list for each function invoked by a thread, and the timers associated with it represent the overall time spent in that method. I needed to deal with the problem of correctly calculating the times for functions that are invoked more than once from the same thread or for recursive functions. To solve the problem, I used a stack of timers. Every time a function is entered (I receive an enter event), I push a set of timers on the timer stack of that function. Any activity is recorded on the timers that are on the top of the function's timers stack. When the function is no longer on the stack (I receive a leave event for it), I update the global timers of the function.
      In the case of recursion, the maximum depth of the stack will be equal to the largest recursion depth that has been encountered during the program execution for the specific function in the given thread. The running time for a method starts when the profiler receives a function-enter event and stops when it receives a function-leave event. The suspension and profile time is being updated for a method if and only if upon reception of a callback, the method was on the top of the call stack of the current thread. Every time I receive a leave event for any method, I update its timers, pop it from the call stack, and add its total time to the callee time of the method that is now on the top of the stack.
      The case of a recursive function requires special handling. In this instance, I don't want to attribute the time from its multiple invocations to the callee time of the function. The callee time should include only direct calls to other methods.
      The output is saved in an ASCII, semicolon-delimited file that allows the user to process it easily using Microsoft Excel or any other text editor. I recommend using Excel because it provides sorting on any column, so the processing of the input file becomes a trivial task. Figure 6 shows a snapshot of the produced output, and Figure 7 provides an explanation of the calculated function times. The values of the Profiler and Suspended times were calculated when the function was on the top of the call stack for a specific thread. For example, if I sort by the Exclusive time I will find the function that took the longest to execute. Currently the tool doesn't provide a call graph, but this could easily be added.

Running the Sample

      All the code that implements the timer logic is located in ProfilerInfo.cpp, and the data structures I am using are in the form of templates and are located in container.hpp. Instructions on how to build the sample are included in the readme file that accompanies the source code.
      To enable profiling you need to do the following:

  @set Cor_Enable_Profiling=0x1
@regsvr32 /s ProfilerHST.dll
@set COR_PROFILER={FA0C6E77-AFF4-4c74-95F9-6216AFDE992C}


There is a batch file that will do this for you. Simply run EnableProfiler.bat and from that point on, you are ready to profile any managed application. At the end of the run, open the generated log file and analyze or parse the results. The display of the results can be customized simply by changing the functions FunctionTimingInfo::Dump and FunctionTimingInfo::DumpFunctionData in ProfilerInfo.cpp. The profiler keeps everything in memory and during the shutdown process dumps the results in a log file. I have not provided the most efficient implementation in terms of speed since that wasn't my objective. The time spent during the dumping of the information into a file is not attributed to any thread since it is pure overhead.

For background information see:
NET Framework Developer's Guide: Runtime Profiling

Anastasios Kasiolas joined Microsoft in July 1999 as a Software Design Engineer in Test. He now works in the .NET Common Language Runtime group in the Developer Services QA team. He is interested in systems programming, distributed applications, debugging, and performance.