High-Quality Editing and Display of Mathematical Text in Office 2007

Article
09/13/2006

This post is a summary of material I've given in recent talks on math in Office such as this one. In the talks, I describe and demonstrate how Unicode's rich mathematical character set combined with OpenType font technology, TeX 's mathematical typography principles, and enhanced autocorrection can be used to produce high-quality, streamlined technical text processing in Office 2007. The approach is currently implemented in Word 2007 and in Office 2007's RichEdit editor, which is also used in the Microsoft Math Calculator. The facility is ready to be added to a variety of other Office applications, but they'll have to wait until their next versions.

The project was considerably harder than any of us imagined it would be. Mathematical typography is very intricate and varied, and somehow along the way, our typography quality bar kept getting pushed higher and higher. In addition, making it all work in a rich international text environment introduced many complications one might not expect, such as interaction with complex scripts, objects like hyperlinks and ruby, and nonASCII keyboards. On the other hand, the Office environment offers many advantages.

This post starts by describing eight math infrastructures that proved to be invaluable. Then it goes into some detail as to how editing and display are accomplished.

Eight Math Infrastructures

Infrastructures outside and inside of Microsoft have emerged to enable major advances in the editing and display of mathematical formulae. While TeX has been widely available since 1986, most of the other infrastructures have become available only recently.

[La]TeX: current tech-doc standards
Unicode 4.0: includes ~2000 math symbols
MathML 2.0: math K – 12 and beyond
OpenType font technology: special math tables
New math font (Cambria Math)
Math layout handler
Shared math input components
MS Office environment, autocorrect

Let's consider these infrastructures one at a time

[La]TeX

TeX (see The TeXbook, by Donald Knuth) is a widely used document preparation program that provides both fundamental examples and many specifications for our new math editing and display facility. TeX and its enhanced versions based on LaTeX are the most dominant technical document preparation programs today, and are used to typeset technical books and journals throughout the world. They're also used widely on the web to display technical documents, either in TeX or pdf form. The experts and users alike agree that TeX's typography is excellent and sufficient to meet their needs. The program allows the user or copy editor to tweak settings to match end preferences. TeX's input method can be used with any plain-text editor. While easy to use in principle, the method becomes awkward for complicated mathematical formulae. In addition, one of TeX's strengths—easy definition of macros—is also a problem when it comes to interchange.

The TeXbook is a user manual that includes a detailed specification for mathematical typography. We have used many of its choices and methodology in creating our solutions, which are appropriately enhanced with the use of OpenType tables and some additional constructs. Although the TeX source code is available, it cannot be used directly for several reasons. First the code is like a web rather than being hierarchical and uses many global variables. This makes it cumbersome to employ in the instance-oriented contexts used at Microsoft. Complicating this is that TeX is a complete document imaging system, not one limited to mathematics. As such many aspects of the program that are used for mathematics are used also for other kinds of layout like headers, footers, figures, and footnotes. Extricating the mathematical algorithms from this web of code would be significantly harder than recreating the desired display quality using our own methodologies and the specifications given in The TeXbook, especially in Appendix G. Furthermore we want to take advantage of our OpenType math fonts to obtain better positioning of subscripts, superscripts, and other symbols than possible by default using TeX. Another complication is that Office is an international environment and our math facility needs to be compatible with all languages that we support, potentially simultaneously. We also want to take advantage of contemporary methods to optimize screen displays.

Unicode Support for Math

Unicode is a character encoding system that Knuth would have loved to have had when he and his students developed TeX and the legions of workers embellishing LaTeX would love to have. In fact there are a couple of Unicode TeX's out there, Omega and XeTeX XeTeX, but neither one is in widespread use yet. Unicode 4.0 contains all standard mathematical characters used in print today. This includes about 2000 characters plus all the combinations that can be made with combining marks. As such Unicode provides an excellent foundation for technical documents, significantly better than the character sets used in TeX itself. In particular, all of TeX's characters are included in Unicode or in glyphs variants thereof. Summarizing the Unicode math character set, we have

340 math chars exist in ASCII, U+2200 block, arrows, combining marks
1016 math alphanumeric characters are in Unicode Plane 1 or Letterlike Symbols
591 new math symbols and operators are on BMP
One math variant selector
One new combining character (reverse solidus)

Many of the new math characters were requested by STIX (Scientific and Technical Information Exchange). For displays of all characters in Unicode 4.0, see the charts. For information about the Unicode math characters, see Unicode support for mathematics.

Western mathematical notation uses a basic set of mathematical alphanumeric characters which consists of:

set of basic Latin digits (0 - 9) (U+0030 – U+0039)
set of basic upper- and lowercase Latin letters (a - z, A - Z)
uppercase Greek letters Α - Ω (U+0391 – U+03A9), plus the nabla ∇ (U+2207) and the variant of theta Θ given by U+03F4
lowercase Greek letters α - ω (U+03B1 – U+03C9), plus the partial differential sign ∂ (U+2202) and the six glyph variants of ε, θ, κ, φ, ρ, and π, given by U+03F5, U+03D1, U+03F0, U+03D5, U+03F1, and U+03D6.

Only unaccented forms of the letters are used for mathematical notation, because general accents such as the acute accent would interfere with common mathematical diacritics. Examples of common mathematical diacritics that can interfere with general accents are the circumflex, macron, or the single or double dot above, the latter two of which are used in physics to denote derivatives with respect to the time variable. Mathematical symbols with diacritics are always represented by combining character sequences, except as required by normalization.

In addition to this basic set, mathematical notation also uses the four Hebrew-derived characters (U+2135 – U+2138). Occasional uses of other alphabetic and numeric characters are known. Examples include U+0428 Cyrillic capital letter sha, U+306E hiragana letter no, and Eastern Arabic-Indic digits (U+06F0 – U+06F9). However, these characters are used in only the basic form.

Generally the math alphanumerics substantially reduce the verbosity of markup, although one can construct cases that aren't so verbose. But a markup representation is poor for several reasons: 1) it complicates a search for a bold italic a, since the search engine needs to understand the bold and italic tags or attributes and dissect the tag contents, 2) it doesn't tag the characters individually as math identifiers, which is a MathML requirement, and 3) it introduces complexity into the tag model by introducing multiple variable identifier tags. The last of these disadvantages can be overcome by representing the nature of the variables with attributes, e.g., <mi style=bolditalic>, but this approach is quite verbose for items as small as math characters. Admittedly this approach is necessary to handle (quite rare) alphanumeric math symbols that aren't included in the math alphanumeric block. Searching for such symbols requires a sophisticated attribute-aware search engine since simple plain-text search engines would yield many undesired search hits.

MathML

The World Wide Web Consortium W3C recognized the need for a format for representing scientific and technical information. In fact, the HTML 3.0 working draft (1994) included a proposal for HTML Math from Dave Raggett. In March, 1997, the W3C HTML Math working group was formally constituted.

The first product of the W3C HTML Math working group was the Mathematical Markup Language (MathML). MathML 1.0 was released as a W3C Recommendation in April, 1998. As the first W3C endorsed XML application, MathML is a low-level format for describing mathematics. MathML provides a much needed foundation for the inclusion of mathematical expressions in Web pages and as a common encoding for scientific processors. Indeed, MathML facilitates the use and reuse of scientific content.

The MathML 2.0 specification also provides a wealth of information about putting math on computers. The relationship between MathML and the Office XML for math (OMML) is discussed in posts by Brian Jones (https://blogs.msdn.com/brian_jones).

New Math Font

New Unicode 4.0 Math fonts have been developed both at Microsoft as well as by the STIX committee, which played a key role in generalizing Unicode to include all standard math characters. Our new math facilities have been developed along with the Cambria Math font, influencing one another to obtain ideal results. The Cambria Math effort was managed by Geraldine Wade and Michael Duggan. Cambria Math is part of a TrueType collection that also includes Cambria, Cambria Italic, Cambria Bold, and Cambria Bold Italic. High-quality low-resolution screen display is very important for the way people work with documents in the Internet age: most documents are perused on screen and only printed for purposes of detailed examination. This is a major advantage of our math system. The Cambria typeface was designed by Jelle Bosma and extended for math by Ross Mills and Andrei Burago in collaboration with the ClearType and math-layout groups. It contains extensive math tables, glyph variants and much of the Unicode math set. It is designed with ClearType and excellent screen readability in mind. It enables the best screen-resolution display of math available today.

Math Font Tables

Specialized math tables have been created to control glyph placements. Refinements include positioning subscripts/superscripts horizontally using cut-ins and italic corrections. Many math constants are defined to handle displacements such as axis height, fraction rule thickness, etc. The math tables are formalized as OpenType tables, although they are not yet part of the OpenType standard. The new font tables enable one to automatically position subscripts and superscripts horizontally better than untweaked TeX as well as having richer glyph choices for operators like the integral sign, square root, and growable brackets. The tables include parameters such as the em-size-dependent sub/superscript values

LONG lSubscriptShiftDown;

LONG lSubscriptTopMax;

LONG lSubscriptBottomDropMin;

LONG lSuperscriptShiftUp;

LONG lSuperscriptShiftUpCramped;

LONG lSuperscriptBottomMin;

LONG lSuperscriptTopRiseMin;

LONG lSubSuperscriptMinGap;

LONG lSuperscriptBottomMaxWithSubscript;

LONG lSpaceAfterScript;

In addition math characters have four cut-in values, one for each corner, allowing sub/superscripts to be kerned with their bases.

Glyph Variants

Cambria Math contains full sets of glyph variants that have heavier weighting so that when scaled down to the script and scriptscript levels the stem widths match those of the text level glyphs. The prime (U+2032) and multiple primes need to be superscripted and scaled down accordingly. The dotless i and j are automatically used in the bases of accent objects.

Brackets, braces, parentheses and other growable characters have a number of larger glyph variants as well as arbitrarily large size created using glyph assemblies. When the assemblies are displayed, the pieces are clipped to prevent overlap, since overlaps create ClearType artifacts.

According to a document setting, the italic open-face characters 0x2145 - 0x2149 (differential D, d, and e, i, j) can be displayed as themselves (useful for patent applications) or with the corresponding math italic or corresponding ASCII letters. Serif italic glyphs are used for these in most math publications, but serif upright glyphs are used in some European math publications and math calculation engines. The use of the differential d (U+2146) automatically introduces a small space between it and the preceding character if that character is alphabetic.

Right-to-left math requires mirroring the images of parentheses, integrals, square roots, arrows, etc. Many such mirror images can be obtained by using corresponding Unicode characters. For example the mirror image of a left parenthesis is a right parenthesis and vice versa. But Unicode doesn't have many characters that are mirror images of other characters, such as integral signs and square roots. Futhermore it seems that a glyph variant approach for these characters makes more sense than adding characters to serve as the mirror images. Other approaches include using world transforms and mirrored bitmaps. But these don't solve the problem that the right-to-left character desired sometimes isn't a perfect mirror image, e.g., the contour integral.

The present version of our software doesn't handle true right-to-left math. We handle math zones in right-to-left paragraphs as left-to-right objects, with all characters in the math zone being strong left-to-right except those defined by Unicode to be strong right-to-left.

Math Spacing

Rigorous math spacing is essential for high quality mathematical typography. See Chap. 18 of The TeXbook for a table of values. The MathML 2.0 specification also has math spacing information. In our formatting, operators have math spacing given by extended TeX spacing rules. We use a function object to obtain correct spacing between a mathematical function and its neighbors, and between the function name and the argument list.
We use an n-aryand object to obtain correct spacing between an n-ary operator and its n-aryand, e.g., integrand or summand. The approach automates much of the need for TeX "tweaking" with +tive and -tive thin spaces. We also handle context-dependent spacing of operators like + - , : . More discussion is given in Section 3.15 of the linear format paper.

Math Input Methods

Mathematical expressions are always entered into math zones. These zones are regions of text like those in between $'s or $$'s in TeX, but are handled by a character format run attribute in our approach. Input methods include

Linear format input and manual buildup
Formula autobuildup (FAB)
Math ribbons
Recognition of handwritten formulae
Hex code input
WYSIWYG editing
Hybrid editing (combination of WYSIWYG and FAB)

A handy hex-to-Unicode entry method works with WordPad 2000/XP, Office 2000/XP edit boxes, RichEdit controls in general, and in Microsoft Word starting with Word 2002. Basically you type a character's hexadecimal code (in ASCII), making corrections as need be, and then type Alt+x. Presto! The hexadecimal code is replaced by the corresponding Unicode character. The Alt+x can be a toggle. That is, type it once to convert the hex code to a character and type it again to convert the character back to a hex code. If the hex code is preceded by one or more hexadecimal digits, you need to "select" the code so that the preceding hexadecimal characters aren't included in the code. The code can range up to the value 0x10FFFF, which is the highest character in the 17 planes of Unicode. The only problem with this approach is that some programs use Alt+x for something else (like quit) or the keyboard doesn't have direct access to ASCII alphabetics.

Word's autocorrect has been available for years and provided the inspiration for formula autobuildup. With autocorrect, type \delta and get δ; type \Delta and get Δ. Define \quadratic to be x = (-b ± √(b^2 - 4ac))/2a, then type \quadratic<space> to insert the built-up form of the quadratic equation solution. You can add your own math autocorrect entries using the Equation Tools/Design/Tools/Math Autocorrect… dialog. Type what you want replaced in the "Replace:" dialog and what you want it replaced with in the "With:" dialog. You can put mathematical expressions in linear form into the "With:" dialog. Then when the replace text is encountered, it will be replaced by a built-up form of the replacement text. You can type the Unicode value in the Replace: dialog and hit Alt+x, since that dialog is a RichEdit dialog. The default Office math autocorrect table has the complete TeX math set along with many characters defined in AmSTeX and other places.

Linear Format Math

Keyboard entry of equations is done using the linear format for math defined in the linear format paper. In this format, a simple operand is a span of alphanumeric characters. For example, a simple numerator or denominator is terminated by any nonalphanumeric character, so abc/d gives a built-up fraction with the numerator abc and denominator of d. More complicated operands use parentheses ( ), brackets [ ], or { }. The outermost parentheses in fractions aren't displayed in built-up form. In practice, this approach leads to a linear text that is significantly easier to read than TeX's, e.g., {a + c \over d} , since in many cases, outermost parentheses are not needed, while TeX requires { }'s except for single letters. To force the display of an outermost parenthesis set, one encloses the set, in turn, within parentheses, which then become the outermost set.

A really neat feature of this notation is that the linear text is, in fact, a legitimate mathematical notation in its own right, so it's relatively easy to read. In his Introduction to Tractatus Logico-Philosophicus by Lugwig Wittgenstein (Routledge and Kegan Paul, London 1922), Bertrand Russell said "a good notation has a subtlety and suggestiveness which at times make it seem almost like a live teacher…and a perfect notation would be a substitute for thought." Inspired by this comment over the years, I've attempted to improve computer-oriented notations for math (a future blog will describe the evolution of the linear format from the early versions back in the late 1970s up to the present version).

Nature isn't so kind with subscripts and superscripts, but they're still quite readable. Specifically as in TeX, we introduce a subscript by a subscript operator _ which RichEdit displays as a subscripted down arrow (Word just displays the underscore). Similarly we introduce a superscript with a superscript operator ^, which RichEdit displays as a superscripted up arrow. Subscripts can be compound such as a subscripted subscript, which works using right-to-left associativity. This associativity can be overruled using parentheses as described for fractions.

If you use Unicode's built-in subscripts and superscripts, they should be rendered to look the same as if they had been represented by the corresponding general subscript and superscript markup. The numeric subscripts and superscripts are often used and can streamline the look of technical plain text.

Formula Autobuildup

You can select a linear format expression and hit hot key or button to build it up. You can also select built-up expression and build it back down for editing or plain-text export. But with formula autobuildup, you type in formulas in linear format in a math zone. When you type a character that renders an expression syntactically unambiguous, the expression is built up right in front of your eyes. You can edit expressions in built-up form or in linear form.

For an integral, type \int (which autocorrects to ∫) optionally followed by subscript and superscript for limits, which builds up automatically, leaving the insertion point in the integrand. The deal is that you have to type in the integrand; the programs need to know what it is. Mathematical calculation engines that you might use in conjunction with the programs also need to know what the integrand is. Similarly if you type in the three letters "sin" followed by space, the insertion point is put inside of the argument for the sine function. The program needs to know what the argument is in order to space the expression correctly. And this knowledge is useful for interoperating with math engines.

Many technically oriented people have TeX input "in their fingers". In addition, this kind of input is easy to describe and appears in many readily available books. A problem is that complicated formulae are cumbersome to work with. However this problem goes away in our environment thanks to autocorrect in combination with formula autobuildup. Essentially the user sees the formulae automatically build up on the screen as s/he types them in. This contrasts remarkably with the traditional TeX scenario, in which the user always edits the full original text in TeX's linear format. To get an idea of how simple the new approach is, consider the following. In TeX a user types \delta to see δ in print. With autocorrect and the right autocorrect data file (even Word 97 autocorrect) as soon as a blank or punctuation symbol is typed after the letter a in \delta, the Greek letter δ appears on the screen. No need to wait for a printout or preview.

Similarly with the formula autobuildup facility, one can type in integrals with \int, fractions, square roots, etc., and see them displayed in built-up form on the screen instead of the relatively complicated way they appear when typed in. You never have to search the original plain text input to find where to edit. You just point and click at the right place in a formula and edit as desired. Typically such WYSIWYG editing is preferred once a formula is built up and you can use autoformula buildup wherever you want to, including inside built-up formulas. You can also toggle back to the linear format if that makes things easier, e.g., in converting a fraction to something else.

The ASCII space is rarely needed inside math expressions, since math spacing is automatic. Meanwhile the space bar is the easiest key on the keyboard to hit. So we use the space for a variety of purposes. We use it to terminate autocorrect entries and to terminate expressions. When so used, it is deleted. It is used as a command to build up math objects, to define spacings for , . and : and to force a unary operator to display with binary spacing. A space builds up one subexpression; other operators build up as many as they can. If a space doesn't build up anything, it's simply inserted as a space. Some programs delete such spaces or even beep at the user, but the feedback we've gotten says these approaches aren't very user friendly even if they do help prevent the user from messing up the typography.

Collaborators

This project is the result of many people's efforts:

Math layout handlers: Andrei Burago, Sergey Genkin, Victor Kozyrev, Igor Zverev, Alexander Vaschillo, Anton Sukhanov, Eliyezer Kohen (initial design), Vivek Garg
Word: Jennifer Michelstein, Ethan Bernstein, Said Abou-Hallawa, Ali Taleghani, Jason Rajtar, Yi Zhang
Cambria Math font: Jelle Bosma, Ross Mills, John Hudson, Geraldine Wade, Mike Duggan, Greg Hitchcock, Andrei Burago, Vivek Garg
Math font program library and editor: Sergey Malkin
RichEdit: Isao Yamauchi, Sasha Gil, Mikhail Baranovsky, Hon-Wah Chan, José Oglesby, Greg Heino, Yuriko Rosnow, Parag Palekar, Zane Mumford
Hand writing: Microsoft Research (Asia, Redmond, Belgrade)
Encarta math calculator: Jinsong Yu, Ben Kunz, Luke Kelly

Many thanks to Barbara Beeton, Asmus Freytag, Ron Whitney, Richard Lawrence, and Donald Knuth for very helpful discussions. We're also indebted to users for valuable feedback.

Conclusions

Eight infrastructures allow us to do math display and editing better than ever before. Our high quality math handlers and math font enable typography competitive with TeX. In particular, we achieve the best screen-resolution display of mathematics. Streamlined input methods such as Formula Autobuildup facilitate math entry. The facility is incorporated into Word 2007 and the Microsoft Math Calculator. We hope to add it to PowerPoint, OneNote, IE, …, and maybe even future compilers.