Performance work is slippery

Article
08/13/2008

… like a water snake. Which is why we measure, measure and measure again. Even the most obvious rule of thumb can be wrong. I was optimizing some string parsing code (which is almost always ripe for optimization), and trying to reduce the cost of traversing a string.

When iterating over the elements of an array there are two typical models:
index increment

     while (ptr[index] != 0)
    {
        index++;
        //  do stuff with ptr[index]
    }

pointer increment

     while (*ptr != 0)
    {
        ptr++;
        //  do stuff with *ptr
    }

And while they both generate reasonable code the index increment is typically ~2x the cost of the pointer increment. For iterations that aren’t called much or that have expensive inner operations this is likely not noticeable. This makes sense because it avoids having to recalculate the offset to the current character inside the loop (ptr[index]). The incremented pointer is the current offset. My rule of thumb is to use array indexes when random access is required and use pointer increment for iteration.

As an example for this rule of thumb, I wrote the simplest of string iterations: calculating a string’s length.

 size_t slen_index(PCWSTR str)
{
    size_t index = 0;
    while (str[index] != 0)
    {
        index++;
    }
    return index;
}

size_t slen_ptr(PCWSTR str)
{
    PCWSTR ptr = str;
    while (*ptr != 0)
    {
        ptr++;
    }
    return ptr - str;
}

And then I measured to find out how much faster the ptr loop is (lower numbers are faster):

     slen_ptr   = 265
    slen_index = 188

Surprise, the intuitively “faster” version is actually slower. Peeking at the disassembly provides a quick answer. slen_index() keeps index in a register instead of a stack location, whereas slen_ptr()stores the incremented back into the stack variable. This is a pathological case where the compiler is able to make a better optimization because we don’t do anything with the characters we are traversing. Let's check out functions that actually do something in the loop.

 #define IS_ALLOWED(ch) (((ch) > 0x20) && (((ch) < 0x80))) 

bool isallowed_index(PCWSTR str)
{
    size_t index = 0;
    while (str[index] != 0)
    {
        index++;
        if (!IS_ALLOWED(str[index]))
            return false;
    }
    return true;
}

bool isallowed_ptr(PCWSTR str)
{
    PCWSTR ptr = str;
    while (*ptr != 0)
    {
        ptr++;
        if (!IS_ALLOWED(*ptr))
            return false;
    }
    return true;
}

And now measure (lower numbers are faster):

     isallowed_ptr    = 250
    isallowed_index  = 390

That’s more like it! My rule of thumb makes more sense, but the stronger rule is to measure, measure, and measure again!

PS - these numbers also show that for most functions it probably won't make a difference in your app. But if the function is called a lot, it can (eg strlen()).

Performance work is slippery

Additional resources