Extracting OMML from Word 2003 Math Zone Images

The science and technology publishing industry uses Word 2003 in processing a significant portion of manuscript submissions. The industry hasn’t yet been able to accept manuscripts in which the mathematical text (math zones) is created using Word 2007’s new math facility since the infrastructure currently only works with math zones encoded in the Design Sciences MathType format. To help generalize the infrastructure, the present post describes how the Word 2007 math zone content can be extracted from Word doc files converted from Word 2007 docx format. This post is pretty technical, so most people probably won’t read any further J


More specifically, this post shows how one can extract the Office 2007 MathML (OMML) from math-zone images stored in doc files that have been converted for use in Word 2003 and earlier versions of Word. The main reason for having this information in the doc file is so that if Word 2003 is used to edit the file, the math zones remain alive and intact when reopened in Word 2007. But the information is also useful if you want to extract the OMML using Word 2003 as we see here.


The basic idea is to read the doc file into Word 2003 and save it in the RTF format. The image data in this RTF contains the OMML in the new wzEquationXML shape property value. Shapes are described in the section on Word 97 Through Word 2007 RTF for Drawing Objects (Shapes) in the RTF Specification 1.9.1.


For convenience, here is a quick summary of how the image RTF works. Images are represented by RTF of the form


{\*\shppict {\pict …}}{\nonshppict {\pict …}}


The full information is available in {\*\shppict {\pict …}} group, which is where the OMML is stored. Readers that don’t understand the \shppict group skip it and use the {\nonshppict {\pict …}} group instead, which represents the image in metafile format. The \shppict {\pict…} group contains the shape properties for the image in a {\*picprop…} group followed by some image control words and the binary data for the image itself in the png format.


Each shape property is represented by RTF of the form


                  {\sp{\sn PropertyName}{\sv PropertyValueInfo}


For example, consider the Word 2003 RTF for an image of x2, in which the wzEquationXML property name is displayed in red. The wzEquationXML value group contains a bunch of XML including the OMML, which is given by the <m:oMathPara> …</m:oMathPara> XML.



{\sp{\sn shapeType}{\sv 75}}

{\sp{\sn fFlipH}{\sv 0}}

{\sp{\sn fFlipV}{\sv 0}}

{\sp{\sn pictureTransparent}{\sv 16777215}}

{\sp{\sn fLine}{\sv 0}}

{\sp{\sn wzEquationXML}

{\sv <?xml version="1.0" encoding="UTF-8" standalone="yes"?>\'0d\'0a<?mso-application progid="Word.Document

<…bunch of XML describing document environment…>

<m:oMathPara><m:oMath><m:sSup><m:sSupPr><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:sSupP

r><m:e><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/><w:i/></w:rPr><m:t>x</m:t></m:r></m:e><m:sup><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:h-ansi="Cambria Math"/><wx:font wx:val="Cambria Math"/>

<w:i/></w:rPr><m:t>2</m:t></m:r></m:sup></m:sSup></m:oMath></m:oMathPara></w:p><w:sectPr wsp:rsidR="00000000"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/></w:sectPr></w:body></w:wordDocument>}}

{\sp{\sn fLayoutInCell}{\sv 1}}}




{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}89504e470d0a1a0a0000000d494844520000000f000000140802000000dda5f0450000000373424954050605330b8d80000000017352474200aece1ce9000000



{\nonshppict{\pict\picscalex100\picscaley100\piccropl0\piccropr0\piccropt0\piccropb0\picw397\pich529\picwgoal225\pichgoal300\wmetafile8\bliptag-207549586\blipupi96{\*\blipuid f3a10b6edfd046bb828e459f44f8828d}<…hexadecimal string with metafile data…>

You need to use Word 2003 to get this RTF, since Word 2003 has been patched to write the wzEquationXML property. Word 2007 doesn’t write this property when it writes RTF for the png math zone images, since it writes math zones using math RTF (see the Mathematics section of the RTF Specification 1.9.1).