UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns.

My post Change to Unicode Encoding for Unicode 5.0 conformance now applies to .Net 2.0 with MS07-040 applied.  Updates include a list of known issues, please see the list of known issues for MS07-040 described in KB 931212 for more information.  KB 940521 describes this behavior in pandrticular.  This fix reduces the chance of spoofing similar strings.  Unicode 5.0 specifies this change due to security concerns regarding spoofing.

As mentioned in the KB:

Before this change, invalid characters in the middle of text strings would only be silently removed. For example, the string "AdxD800minxDC00istrator" would change to "Administrator" as the Unicode characters U+D800 and U+DC00 are invalid . This could cause a security problem for some programs. After you install the security update MS07-040, this string would now become "AdxFFFDminxFFFDistrator", and decode to "Ad�min�istrator" where the � is the Unicode replacement character.

The first time we introduced this behavior was in Vista, and since then I've received several reports of issues with the new behavior.  In nearly all of those cases there were usually some flawed assumptions contributing to the problems.  Some examples were:

  • Programs trying to convert byte[] arrays to Unicode (see Avoid treating binary data as a string) and then having problems when the data didn't round trip.  Note that prior to this change the data didn't round trip either, data was lost, but after the change it is more obvious since the FFFD's are present (which is the point of the security aspect of the change by the Unicode consortium).
  • Doing something like that, then trying to make a hash of the resulting value.  After the update the hash doesn't match.  Note that even prior to the update a very large number of values have the same hash, so this was not nearly as secure as the application had hoped.
  • Some applications made oopses with the behavior of Unicode, accidentally decoding extra byte(s) instead of pairs causing illegal UTF-16 or UTF-8.  Those were ignored and the app worked despite the bug, but the update prevents the error from working.

Note that before the update .Net 2.0 on Vista and .Net 2.0 RTM had different Unicode decoding behavior.  With the update applied they have the same behavior.

Hope this is helpful,

Shawn

Comments

  • Anonymous
    July 23, 2007
    Shawn, Thanks for this post. I now understand why this change was made in a security update. Thanks also for explaining the nature of the changes. However, it seems to me that this change needs to be explained to a wider audience of developers. Eventually, everyone will have the security update installed, which means that the behavior of every .NET 2.0 system will have changed. To me, that means that we all need to know about the changes, if only so that we can assure ourselves that the changes don't matter to us. That goes for the other changes in this security patch as well. For instance, there are many who would like to understand the details of why their XML Serialization code no longer works. Thanks again, John Saunders MVP – Windows Server System – Connected System Developer

  • Anonymous
    July 23, 2007
    Sorry, can't help with the serialization issue.  Note that this is consistent with the Vista behavior, so hopefully it'll be easier to verify the Vista and XP behaviors.

  • Anonymous
    September 16, 2007
    (The purpose for the characters below should be apparent presently!) U+fffd U+fffd U+fffd U+fffd U+fffd

  • Anonymous
    September 16, 2007
    (The purpose for the characters below should be apparent presently!) U+fffd U+fffd U+fffd U+fffd U+fffd