Improving CLRProfiler 4: Reducing SampleObject memory consumption by 58%

In the previous three posts, we managed to double the speed of file loading time of CLRProfiler through profile-guided optimization in three simple steps. Now let's take a look at reducing CLRProfiler's memory consumption, making it more useful to real world applications.

I managed to create a 10-Gb profile using a performance test. The test program creates 19.7 million managed objects, averaging 195 bytes each, consuming a total of 3.85 memory, with 32 garbage collections. CLRProfiler loads up the file in 83 secounds with 208Kb private working set.

Using sos's DumpHeap -stat commands, we can easily see what is really consuming memory in CLRProfiler:

002f4cc4 2 1040028 CLRProfiler.TimePos[]
618ff9ac 16048 1371112 System.String
618eebd4 2 4194336 System.UInt16[]
61902938 40227 4754996 System.Int32[]
002f2190 7870007 188880168 CLRProfiler.SampleObjectTable+SampleObject

The most expensive data type in memory is SampleObjectTable::SampleObject. Actually, there are 7.87 million instances of them, occupying 24 bytes each. The SampleObject class itself has 3 integer fields and one pointer inside. These four fields should consume only 16-bytes, but CLR adds 4 byte for method table pointer and 4 more bytes for sync object. 

internal class SampleObject

{

    internal int typeIndex;

  internal int changeTickIndex;

    internal int origAllocTickIndex;

    internal SampleObject prev;

    internal SampleObject(int typeIndex, int changeTickIndex, int origAllocTickIndex, SampleObject prev)

    {

        this.typeIndex = typeIndex;

        this.changeTickIndex = changeTickIndex;

        this.origAllocTickIndex = origAllocTickIndex;

        this.prev = prev;

    }

}

 

If we store SampleObject in an array form, we could convert the previous sample object pointer into an index into that array. Now we can declare it as a structure, and pack them together in a big array, thus removing the 8-byte object overhead. In most cases, typeIndex, changeTickIndex, and OrigAllocTickIndex are small integers which can be stored using 16-bit integers, instead of 32-bit integers. The last field prev, which references to previous SampleObject, could be quite large depending on the problem we're profiling. But normally, the current object and the previous object are not so far apart; that is their differences could be stored as 16-bit integers. To reduce the impact on other code which uses SampleObject, we need to provide a method to reconstruct SampleObject given an index:

/// <summary>

/// Create SampleObject when given an index into storage

/// </summary>

internal SampleObject GetSampleObject(int index)

{

    UInt16[] chunk = m_sampleChunks[index / SampleObjectChunkSize] as UInt16[];

    int p = index % SampleObjectChunkSize;

    SampleObject obj;

    UInt16 w0 = chunk[p];

    if ((w0 & bit_small) != 0)

    {

        if ((w0 & bit_noprev) != 0)

        {

            obj = new SampleObject(w0 & 0x3FFF, chunk[p + 1], chunk[p + 2], 0);

        }

        else

        {

            obj = new SampleObject(w0 & 0x3FFF, chunk[p + 1], chunk[p + 2], index - chunk[p + 3]);

        }

  }

    else

    {

        obj = new SampleObject(

                    (((int) chunk[p ]) << 16) + chunk[p + 1],

                    (((int) chunk[p + 2]) << 16) + chunk[p + 3],

                    (((int) chunk[p + 4]) << 16) + chunk[p + 5],

                    index - (((int) chunk[p + 6] << 16) + chunk[p + 7]));

    }

    return obj;

}

Here is what DumpHeap -stat shows after the change:

609bf9ac 11663 1242188 System.String
00584a80 4 1808060 CLRProfiler.TimePos[]
609c2938 40176 5176228 System.Int32[]
609aebd4 935 84447264 System.UInt16[]

188.8 Mb of SampleObject is replaced by 80.2 Mb increase in UInt16[] objects (42% of the original size). The 7.87 SampleObjects are packed in 16-bit integer arrays. There will be more saving when running on 64-bit machines.