Transforming Open XML Documents using XSLT

Transforming Open XML documents using XSLT is an interesting scenario.  However, Open XML documents are stored using the Open Packaging Convention (OPC), which are essentially ZIP files that contain XML and binary parts.  XLST processors can’t open and transform such files.  But if we first convert this document to a different format, the Flat OPC format, we can then transform the document using XSLT.  Perhaps the most compelling reason to use XSLT on Open XML documents is document generation.  You can take a source ‘template’ Open XML document and source XML data document, and produce a finished, formatted Open XML document with content derived from the source XML data document.

This post is one in a series of four posts that present this approach to transforming Open XML documents using XSLT.  The four posts are:

Transforming Open XML Documents using XSLT (This Post)

Presents an overview of the transformation process of Open XML documents using XSLT, and why this is important.  Also presents the ‘Hello World’ XSLT transform of an Open XML document.

Transforming Open XML Documents to Flat OPC Format

This post describes the process of conversion of an Open XML (OPC) document into a Flat OPC document, and presents the C# function, OpcToFlat.

Transforming Flat OPC Format to Open XML Documents

This post describes the process of conversion of a Flat OPC file back to an Open XML document, and presents the C# function, FlatToOpc.

The Flat OPC Format

Presents a description and examples of the Flat OPC format.

This approach is particularly important in SharePoint – it allows us to write and install a SharePoint feature that can transform Open XML documents in a general way using XSL style sheets stored in document libraries.  XSLT developers can then create a variety of XSL transforms of Open XML documents without writing and installing server-side code for each type of transform.  I’ll be writing about this powerful technique in the near future.

This blog is inactive.
New blog:

Blog TOCAs you can see in the code in the linked posts, the conversion to and from the Flat OPC format is simple – less than 100 lines of code for each type of conversion.

The program OpcXsltTransform (attached) uses the code in the above posts, and the classes in System.Xml.Xsl to perform a transform using a supplied XSL style sheet.

To run OpcXsltTransform, you supply as arguments the source Open XML document, the destination Open XML document, and the name of the XSL style sheet.  You can optionally supply a fourth argument, -OutputIntermediate.  If you supply this argument, then after converting the source Open XML document to the Flat OPC format, OpcXsltTransform saves this file to the disk, and after the XSL transform, OpcXsltTransform saves the new Flat OPC file to the disk.  This can be helpful in debugging the XSL style sheet.  The name of the source intermediate file is the same as the source DOCX, but with a file extension of ‘.xml’.  The name of the destination intermediate file is the same as the destination OPC file, but with a file extension of ‘.xml’.  Here is the usage of DocXslTransform:

DocXslTransform -source source.docx -destination dest.docx -xsl transform.xsl [-outputIntermediate]

Here is an artificially simplistic XSL style sheet that works with the Flat OPC format.  It finds all paragraphs that have a text node that contains ‘Hello World’ and replaces those text nodes with a new one that contains ‘Goodbye World’.

<xsl:templatematch="w:document/w:body/w:p/w:r/w:t[node()='Hello World']">
<w:t>Goodbye World</w:t>
<!-- The following transform is the identity transform -->

This style sheet, as well as the DOCX that it transforms, are included in the bin/debug directory in the attached ZIP file.  You can build the project and run it to see the transform take place.