Weird F020-F0FF characters in Word’s RTF

People have been inquiring about Word RTF’s occasional use of the Unicode Private Use Area (PUA) characters in the range U+F020..U+F0FF. These codes are also used in WordProcessingML defined by the ECMA-376 standard. This post explains what Word means by those characters. But first note a couple of things:

 

1) Unicode assigns no meaning to characters in the PUA, that is, those in the range U+E000..U+F7FF. So it’s up to a higher-level protocol to define the meaning. In general it’s a really bad idea to use the PUA if you’re interested in data interchange, because the program that reads such data may well display nothing or display completely different characters than you intended. That’s why it’s called “private use”, something for you and your friends who are in cahoots with you.

 

2) The original syntax of an RTF control word defines the numeric parameter to be a signed 16-bit decimal number. For most control words that have a numeric parameter, Word does use a signed 16-bit decimal number. In particular, for the \uN Unicode control word, N has this format. If the high bit of a 16-bit number is 1, the number is negative and this is true for all codes in the range U+8000..U+FFFF. To get the RTF 16-bit signed decimal values, convert Unicode hex values to decimal and if greater than 32767, subtract 65536. Accordingly U+F020 is represented by \u-4064 and U+F0FF by \u-3841. It’s true that later on Word learned that 32-bit numbers exist and so some more recent RTF control words like \rsid (revision save IDs) have parameters much larger than 65536, let alone 32767 (the most positive 16-bit signed number). RichEdit even supports reading \uN with N being the decimal UTF-32 value corresponding to a surrogate pair (now isn’t that cool?!)

 

Given the strong recommendation not to use the PUA, why would Word nevertheless go ahead and use it? If the choice were made today, I seriously doubt that Word would, but back in 1995 when Word started switching to Unicode, it wasn’t so obvious. Furthermore it solved a pesky problem with special nonUnicode fonts known as “symbol fonts”, or more precisely symbol-charset fonts. By their very definition, these fonts do not use Unicode code points. So while U+0041 stands for ‘A’ in a Unicode font, in a symbol-charset font like Wingdings, it stands for whatever character has hex code 0041, namely for Wingdings A. You must agree that A looks nothing like ‘A’, so the Word 97 folks decided to give it a distinct value, namely F000 + 41 = F041. This is also the value that Microsoft TrueType symbol-charset fonts use in the Unicode cmap (character-to-glyph mapping table). Often a symbol-charset character is defined by a SYMBOL field with a character code in the range 20 to FF.

 

A key point here is that Word RTF may treat any symbol-charset character this way, so merely getting a character in the range U+F020..U+F0FF does not mean you know which symbol-charset font is involved. For that you need to find the last symbol-charset font control word \fN, look up font N in the font table and find its face name. The charset is specified by the \fcharsetN control word and the symbol-charset is N = 2. In contrast, RichEdit does not use U+F020..U+F0FF for characters in symbol-charset fonts; it uses the native values 0020 through 00FF, and both RichEdit and Word read the resulting RTF just fine. In many cases Word, too, uses the range 0020 through 00FF for symbol-charset font characters, so Word's use of F020 through F0FF isn't exclusive.

 

For math probably the most relevant symbol-charset font is the Symbol font itself, since it has most of the Greek letters used in math along with some useful math operators and operator pieces. But since Unicode has nearly 100 times as many math characters and includes all 224 characters in the Symbol font, the Symbol font is basically useless for math at this point in time. Read: avoid it if you can J