StrCmpLogicalA
Due to a policy around supporting and encouraging internationalizable software, there are certain system APIs which only have a UNICODE version (functions typically with the W postfix). One example which is missing the plain old ANSI version is the StrCmpLogicalW function. Unfortunately for me, I was in need of an A version. "Fixing" my code to use the W version was impractical given the data I had to work with, but I really liked the idea of having my strings sorted with numbers being placed in an order which made sense to a human being. So I did the only logical thing left to do...
Just to get my code up and running, I did the easy thing which was to write an A version which takes the ANSI strings as inputs, allocates two new buffers, converts the inputs into UNICODE, and then calls the W version. Needless to say this incurs a lot of overhead, as all this has to happen for each string compare, which isn't pretty.
My next cut (shown below) actually walks the ANSI strings and does some nitty-gritty comparisons.
int StrCmpLogicalA(const char *psz1, const char *psz2)
{
// handle NULL inputs
if(!psz1 && !psz2) return 0;
if(!psz1) return -1;
if(!psz2) return 1;
while(*psz1 && *psz2)
{
if(*psz1 >= '0' && *psz1 <= '9' &&
*psz2 >= '0' && *psz2 <= '9') // numerical
{
// keep track of where we are starting
const char* digit1 = psz1;
const char* digit2 = psz2;
// strip off any leading zeros
size_t leading1 = 0;
size_t leading2 = 0;
while(*digit1 == '0')
{
++leading1;
++digit1;
}
while(*digit2 == '0')
{
++leading2;
++digit2;
}
// scan to the end of the digits
while(*psz1 >= '0' && *psz1 <= '9')
++psz1;
while(*psz2 >= '0' && *psz2 <= '9')
++psz2;
// calc the number of digits
size_t len1 = psz1 - digit1;
size_t len2 = psz2 - digit2;
if(len1 < len2) return -1;
if(len1 > len2) return 1;
// now start walking over the digits
while(digit1 < psz1 && digit2 < psz2)
{
// test the number
if(*digit1 < *digit2) return -1;
if(*digit1 > *digit2) return 1;
++digit1;
++digit2;
}
// if we reach here, the numbers are the same, and
// psz1 and psz2 already point at the next character
// to test
// since we kept track of leading digits, we can add
// precedence based off that.
if(leading1 < leading2) return -1;
if(leading1 > leading2) return 1;
}
else // mixed and non numerical
{
unsigned char c1 = *psz1;
unsigned char c2 = *psz2;
// strip off the lower case bits
if(c1 >= 'a' && c1 <= 'z') c1 &= 0xDF;
if(c2 >= 'a' && c2 <= 'z') c2 &= 0xDF;
// test the characters
if(c1 < c2) return -1;
if(c1 > c2) return 1;
// else they are the same, keep walking
++psz1;
++psz2;
}
}
// check for unprocessed characters
if(!*psz1 && *psz2) return -1;
if(*psz1 && !*psz2) return 1;
// strings are equivalent
return 0;
}
As you can see this version does a case insensitive compare, if you happen to need one that does a case sensitive compare, then you just need to comment out the two lines of code which strip out the 6th bit. If you want "abc03def" to be treated the same as "abc003def", then comment out the two lines where we test the leading digit count.
Disclaimers
1. This has only gone through minimal testing, use at your own risk, etc. (Though if you do find bugs let me know!)
2. Even though I named it StrCmpLogicalA in this blog, I make no implied statement about behavioral conformance with StrCmpLogicalW. I've never seen the algorithm or code behind the official API, and the above sample undoubtedly does things differently. In fact, I know it does some things differently.
3. Developers are encouraged to use UNICODE and my apologies go out to those trying to facilitate its wider-spread usage.
Comments
- Anonymous
July 13, 2015
I want to use your function but it is not compatible with qsort. Can you make a compatible version to pass your function as the forth parameter to qsort? msdn.microsoft.com/.../zes7xw0h.aspx It will be very useful. THANKS!!