2.1.3.2 Encoding HTML into RTF

The translation between HTML and RTF is not specified by this algorithm and is implementation-dependent. To emit RTF-encapsulated HTML, implementers MAY<9> do the following:

  • Produce a valid RTF document, as specified by [MSFT-RTF].

  • Emit a FROMHTML control word in the RTF header after the \rtf1 control word to indicate that encapsulated HTML is included in the RTF document.

  • Specify a default code page for text runs in RTF by using the \ansicpgN keyword, as specified in [MSFT-RTF].

  • Emit a font table to define fonts used in RTF.

  • Specify character set information for each font when necessary, as specified in [MSFT-RTF].

  • Produce a single empty HTMLTAG destination group with the Destination flag set to INBODY and the TagType flag set to P ({\*\htmltag64}) before any shared visible text in a generated RTF document (for example, immediately following the RTF header, as specified in [MSFT-RTF]).<10>

  • Use an HTMLTAG destination group to preserve any content of the original HTML document that does not have direct representation in RTF (such as HTML tags, text with HTML character references, HTML comments, or insignificant whitespace).

  • Produce an HTMLTagParameter HTML fragment in any HTMLTAG destination control word (except the {\*\htmltag64} empty destination group).<11> Any text inside an HTMLTAG destination group can be encoded by a default RTF code page, as specified in [MSFT-RTF]. Any text that cannot be represented by using a default RTF code page without data loss can be encoded by using \uN control words.

  • Use HTMLRTF control words to suppress de-encapsulation of any RTF content that is not part of the original HTML content. In particular, any emitted RTF control words that change character-formatting properties, such as \f, \fs, \b, or \i, can<12> be explicitly suppressed by the HTMLRTF control word. Any corresponding original HTML content can be encapsulated in HTMLTAG destination groups, as specified in section 2.1.3.1.4.

  • Produce text in a code page that corresponds to the current font for each text run, or in a default RTF code page if no current font is selected for a text run (outside of an HTMLTAG destination group, and when not suppressed by an HTMLRTF control word). Any characters that cannot be represented in a selected code page can be encoded by using the \uN control word.