RichEdit Plain-Text Controls

A Unicode plain-text editor appears to have a single set of character formatting properties for the entire text and a single set of paragraph formatting properties. With NotePad, for example, you can choose a normal, bold, italic, or bold-italic font of any reasonable size and your choice is used consistently throughout the text (at least if the text is all of one script). In particular, you cannot have a run of text with a bold font followed by a run with a normal weight font. Such variations are nominally the province of rich text. Paragraph properties are limited to the BiDi attributes of left versus right alignment and left-to-right versus right-to-left directionality and they also are used uniformly throughout the text.

But things aren’t as simple as these seem partly because TrueType glyph indices are 16-bit numbers, limiting a single font to 65535 glyphs, and Unicode has more than 110,000 characters. Therefore multiple fonts are needed to display arbitrary Unicode text. In this sense, any Unicode plain-text editor has to have some degree of “richness”. Furthermore IME’s (input method editors) are used for entering Japanese, Chinese and Korean text and the IME’s need temporary character formatting such as underlining. Accordingly NotePad uses multiple fonts when necessary and has temporary formatting for IME’s as well. Spell and grammar checking requires similar temporary formatting, such as squiggly underlines.

This post describes why RichEdit has plain-text controls and a bit how they work. Implemented with an engine capable of very rich text, they have somewhat more character formatting richness than you might expect. The richness is handy for temporary formatting beyond what’s needed by IME’s.

To understand why RichEdit offers plain-text controls, let’s look back into end of the last century. Microsoft Word 97 (along with Excel 97) introduced Unicode to the real world in 1997. Up to then, no major computer application was based on Unicode. Office 97 was developed on the Windows NT 4.0 operating system, which was based on Unicode, and on pre-release versions of Windows 95, which had some Unicode support. At the time, NT 4.0 was used primarily for program development. The Windows OS that ran Office 97 on personal computers was almost exclusively Windows 95, which didn’t have a Unicode plain-text edit control. Office 97 needed a Unicode plain-text edit controls for various kinds of built-in and programmable dialog boxes and also for all the Outlook text boxes. Since the Office division owned RichEdit 2.0, which was based on Unicode, the decision was made to extend RichEdit to deliver plain-text as well as rich-text controls. For these plain-text controls unless an East-Asian IME (input method editor) composition was active, the default CHARFORMAT2 was used for the entire control; character format runs weren’t even instantiated. As such the controls were limited to displaying text with a single font. Also the undeletable final carriage return that appears in a rich-text control doesn’t occur in a plain-text control and there is only a single set of paragraph formatting properties.

Windows 2000 is based on NT 5 and offers Unicode plain-text controls. But by that time RichEdit was pretty thoroughly integrated into Office and it more closely mimicked Word’s user-interface editing commands than the system edit control. Office 2000 needed to support complex scripts such as Arabic, Hebrew, Thai, and Indic scripts and Windows wanted to ship a single global RichEdit instead of a plethora of localized versions. Accordingly RichEdit was generalized to support such scripts with the help of then new Uniscribe and LineServices components. The resulting version was named RichEdit 3.0 and it shipped with Windows 2000. With the addition of security fixes, it still ships today to preserve backward compatibility with older applications, although more recent applications have switched to later versions. To accommodate the complex scripts and multilingual text in general, the plain-text controls were allowed to have text runs with the different fonts and other properties needed to handle complex scripts. The EM_SETCHARFORMAT message was restricted to applying character format changes to the entire text. So typically if you typed the bold hot key ctrl+b, you’d see the whole text bolded.

But unlike the EM_SETCHARFORMAT message, the ITextFont character formatting interface was not restricted to apply only to the entire text. Such a restriction would complicate the temporary formatting needed for IME composition and proofing tools. In any event, ITextFont continues to work essentially as it does in rich text, allowing the RichEdit client to assign multiple character formats including the ability to color text runs and give the runs attributes like bold and italic. Such per-text-run attributes can be handy, for example, when you want to highlight reserved words in a plain-text program.

Another feature of RichEdit plain-text controls on the desktop (though not on the phone) is that you can embed OLE objects in them. This was a requirement of Outlook 97, which needed to embed OLE objects for resolved email aliases into the plain-text To…, Cc…, etc. controls. Later on in RichEdit 5.0, which shipped with Office XP, that need could have been satisfied with the RichEdit blob, a lightweight OLE object that runs on the phone as well since it doesn’t require the system OLE libraries. Blobs were added for OneNote and will be the subject of a future post. Blobs are not exposed in the msftedit.dll that ships with Windows, so they aren’t documented in MSDN. Starting with RichEdit 8, they are also used internally, specifically to handle images. Still another feature of RichEdit plain-text controls is that hyperlinks can be automatically recognized just as they can be in rich-text controls. So if you enter a URL into Outlook’s Subject text box (a RichEdit plain-text control), you’ll see it displayed in blue with a blue underline.

These text-run formatting properties can’t be persisted by the built-in RichEdit file I/O, since plain-text controls only enable plain text to be copied and pasted. As such the formatting is temporary. RichEdit offers temporary formatting for rich-text controls (see ITextFont::Reset()), which can be used in plain-text controls as well. Note that the plain-text character formatting can be persisted by the client if it desires by reading what’s in the RichEdit backing store via the appropriate messages (EM_GETCHARFORMAT) and/or interfaces (ITextFont[2]). Such an approach for rich-text controls is used by WordPad to read and write docx and odt files, neither of which are supported by RichEdit natively. It’s also used by OneNote to export/import HTML and the OneNote file format.

You might think that such general per-text-run character formatting flexibility in an allegedly plain-text control is a bug that should be fixed. But since the flexibility has shipped now for over 14 years, it wouldn’t be wise to change it now. There may be applications out there that would break if more rigorous plain-text functionality were enforced.

You might also wonder why an application would use plain-text controls at all. Clearly rich-text controls offer a lot more capabilities. But sometimes you want to limit the functionality. For example, plain-text controls cannot have tables, math, or multiple paragraph formats, and they have limited copy/paste functionality. Password controls shouldn’t have such generality and RichEdit password controls are forced to be plain-text controls. Plain-text controls also use the Unicode BiDi Algorithm, which isn’t used by default in rich-text controls. And lastly, the undeletable final carriage return of rich-text controls has been known to surprise folks in simple editing scenarios.