Increasing Word Automation Performance for Large Amounts of Data by Using Open XML SDK

Summary: Learn how to insert large amounts of data quickly by using Open XML. On a typical 4-GHz computer with a single processor, you can insert a 300-row table with an image in each cell by using a content control in less than five seconds. Using Word automation, this process takes much more than a minute.

Applies to: Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word 2007 | Word 2010

In this article
Improving Performance When Manipulating Documents
Inserting Content into the Flat Open Packaging Conventions (OPC) Package
Sample Overview
Performance Issues
Conclusion
Additional Resources
About the Authors

Published:   March 2010

Provided by:   Anil Kumar, Microsoft Corporation | Ansari M, Microsoft Corporation | Sarang Datye, Microsoft Corporation | Sunil Kumar, Microsoft Corporation

Contents

Download the sample code

Improving Performance When Manipulating Documents

Prior to Open XML, if you had to perform document manipulation operations in Microsoft Office Word 2007 in an open document, you had to use technologies such as macros and Microsoft Visual Studio Tools for the Microsoft Office system (3.0). With Open XML formats, you can alter document contents by using the Open XML SDK 1.0 for Microsoft Office. If you are inserting many objects, then inserting by using the Open XML SDK 1.0 or the Open XML SDK 2.0 for Microsoft Office can give you a significant performance improvement.

Inserting Content into the Flat Open Packaging Conventions (OPC) Package

The following diagram shows the process for inserting content by using the Open XML SDK 1.0.

Figure 1. Process chart for inserting content using the Open XML SDK 1.0

Process chart for inserting content with Open XML

The following sections describe the steps in the sample.

Preparing the Flat OPC XML

Flat OPC is an Open XML document that is converted to a single XML file. Each individual document part from the Open XML document is stored as an XML element under the package element. Binary parts, such as images, are converted to base-64 ASCII format. The relationship between the document parts is also stored in the flat OPC XML document.

You can see an example of this format when you save a document as XML in Word 2007 and Microsoft Word 2010. To see an example, create a small document, save it as XML, and then look at the result.

This XML document is composed of various parts. You can identify each document part by its name. Table 1 describes names of frequently used document parts.

Table 1. Frequently used document parts

Document Part

Description

pkg:name="/word/document.xml"

This XML fragment represents the main document that contains all the content such as paragraphs, tables, and pictures.

pkg:name="/word/styles.xml"

This XML fragment represents the styles part (similar to styles XML in an Open XML package).

pkg:name="/word/numbering.xml

This XML fragment represents the numbering part (similar to numbering XML in an Open XML package).

<pkg:part pkg:name="/word/media/image3.png" pkg:contentType="image/png" pkg:compression="store"> <pkg:binaryData> iVBORAAlwSFlzAAAS dAAAEn…

This XML fragment represents the binary part that contains the binary data encoded in Base-64 format, with 76 lines of characters. There must not be a line break at the beginning or end of the data. (Typically, this part represents an image.)

There are two kinds of relationships: a package's relationship to document parts, and the parts' relationships to other document parts.

  • The following code example shows a package's relationship to document parts.

    <pkg:part pkg:name="/_rels/.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" >
        <pkg:xmlData>
          <Relationships xmlns="https://schemas.openxmlformats.org/package/2006/relationships">
            <Relationship Id="rId3" Type="https://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" Target="docProps/app.xml"/>
            <Relationship Id="rId2" Type="https://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target="docProps/core.xml"/>
            <Relationship Id="rId1" Type="https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
          </Relationships>
        </pkg:xmlData>
      </pkg:part>
    
  • The following code example shows parts relationships to other document parts.

    <pkg:part pkg:name="/word/_rels/document.xml.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" >
        <pkg:xmlData>
          <Relationships xmlns="https://schemas.openxmlformats.org/package/2006/relationships">
            <Relationship Id="rId8" Type="https://schemas.openxmlformats.org/officeDocument/2006/relationships/control" Target="activeX/activeX2.xml"/>
            <Relationship Id="rId9" Type="https://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image3.png"/>
          </Relationships>
        </pkg:xmlData>
      </pkg:part>
    

To insert content in a Word document, you must prepare an XML document in the flat OPC format. For more information about Flat OPC, see Eric White's blog post, Flat OPC Format.

The following sections explain how to do this.

Converting Flat OPC Packages to In-Memory Packages

You really do not want to work directly with flat OPC; instead you want to use the Open XML SDK 1.0. There are advantages to using the Open XML SDK 1.0 or the Open XML SDK 2.0 instead of working with the raw XML.

The Open XML SDK 1.0:

  • Is less prone to developer XML mark-up issues, because it is strongly-typed.

  • Supports strongly-typed access to document parts.

  • Makes it easier to work with relationships between parts.

The Open XML SDK 2.0:

  • Supports strongly-typed access to document parts and the contents of a document part.

  • Enables developers to take full advantage of additional features such as document validation.

You want to convert the flat OPC XML document to an Open XML document, which enables you to use the Open XML SDK 1.0 or the Open XML SDK 2.0. You create a package and write contents to a memory stream. You can use this memory stream directly with the Open XML SDK 1.0 or the Open XML SDK 2.0.

In this example, there is an extension method for the Range object, which returns a memory stream. We use the WordOpenXml property of the Range object to return a Flat OPC document.

The Range.WordXml object returns a Flat OPC XML document for that range as a string. You use this to prepare an in-memory package so that your code can access necessary parts such as the main document part, the styles part, and the numbering part.

Next, you iterate through all the parts and prepare a package. You can open an Open XML SDK 1.0 WordprocessingDocument by using this package. The Package class is in the System.IO.Packaging namespace, which underlies the Open XML SDK 1.0 and the Open XML SDK 2.0. The following example shows how this package stream can be used.

Stream packageStream = this.Paragraphs[1].Range.GetPackageStreamFromRange();

After creating this package, you can create an Open XML formats document directly from the package. You can then use the Open XML SDK 1.0 to work with this document. First, clear the body contents, and then add data. This effectively replaces the existing content with the new content that you are constructing.

Processing Images

If the in-memory document includes images, then you must add those images to the image part by using a unique relationship identifier (ID). Markup in the document part refers to this relationship ID to identify the image. If there is any mismatch when referring to this relationship ID, the in-memory document is corrupted.

In Figure 1, we added an image part with relationship ID "RImage", as shown in the following example.

ImagePart imgPart = wordDoc.MainDocumentPart.AddImagePart(ImagePartType.Jpeg, "RImage");
System.Drawing.Image img = (System.Drawing.Image)Resources.ResourceManager.GetObject("RImage");
img.Save(imgPart.GetStream(), System.Drawing.Imaging.ImageFormat.Jpeg);

This means we are preparing the following fragment in Flat OPC XML.

<pkg:part pkg:name="/word/media/image2.png"  pkg:name="/word/media/image2.png"
<pkg:binaryData>
      iVBORw0K…
</pkg:binaryData>
  </pkg:part>
<Relationship Id="rId17" Type="https://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image2.png"/> 

The code that is presented in the example refers to this image by using the relationship ID as follows.

GenerateDrawing("RImage",value,"RImage.jpg",321933L,288000L,
                     19050L, 17957L,0L,0L))

Converting Back to Flat OPC XML from In-Memory Packages

At this point, you modified the in-memory Open XML formats document, but now you must convert it back to flat OPC so that you can insert it. The following code shows how to convert back to Flat OPC XML. It iterates through all the parts of the package and prepares a single XML file that contains all parts. Eric White explains about this conversion in his blog post, Transforming Open XML Documents to Flat OPC Format.

public static XDocument OpcToFlatOpc(Package package)
{
    XNamespace pkg = "https://schemas.microsoft.com/office/2006/xmlPackage";
    XDeclaration declaration = new XDeclaration("1.0", "UTF-8", "yes");
    XDocument doc = new XDocument(
        declaration,
        new XProcessingInstruction("mso-application","progid=\"Word.Document\""),
        new XElement(pkg + "package",
            new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
            package.GetParts().Select(part => GetContentsAsXml(part))
        )
    );
    return doc;
}

Finding Suitable Ranges

Next, you must find a suitable range where you can insert this Open XML. This can be any range that is easily identifiable (for example, a paragraph, bookmark, or content control range).

Avoiding Screen Refreshes

Unless you suppress updating the user interface (UI), when you insert the Flat OPC XML, you will see a flicker in the document (it is actually a kind of document merging). This flicker might not be appealing to the user. You can avoid this by suspending the refreshof the Word UI ,and resume the UI refresh after you complete inserting the Flat OPC document.

To control the Word UI refresh, set the ScreanUpdating property.

this.Application.ScreenUpdating=false;

After completion of the operations, set ScreenUpdating back to true.

this.Application.ScreenUpdating=true;

Sample Overview

The following code shows how to insert a large amount of data by using Open XML. This example assumes that you have the Open XML SDK 1.0 or the Open XML SDK 2.0 installed.

Important

You must download and install the following items to run this code.

To run the sample

  1. Build the sample project, and then press F5 to open a Microsoft Word document.

  2. Click Insert table using Open XML SDK 1.0.

    This inserts 300 table rows and images to the document by using the Open XML SDK 1.0.

Understanding the Sample Logic

First, the code sample begins with setting the screen refresh property to false. Next, it prepares the XML, finds a suitable range, and then inserts the Flat OPC XML.

Building this kind of Flat OPC XML can be challenging. You can use the Range.WordOpenXml object to get the Flat OPC XML for that range. This is abstracted out, with an extension method as shown in the followg example.

// Get stream for the range. This is the package stream.
Stream packageStream = this.Paragraphs[1].Range.GetPackageStreamFromRange();
// Use the Open XML SDK 1.0 to process it.
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(packageStream, true))
{
    wordDoc.MainDocumentPart.Document.Save();
    // Flush the contents of the package.
    wordDoc.Package.Flush();
    // Convert back to flat OPC by using this in-memory package.
    XDocument xDoc = OpcHelper.OpcToFlatOpc(wordDoc.Package);
    // Return this string.
    openxml = xDoc.ToString();
    this.Application.ScreenUpdating = false;
    Word.Range range = FindRange("bkflatOpc");
   // Insert this XML.
    range.InsertXML(openxml, ref missing);
    this.Application.ScreenUpdating = true; 

After you have a stream, you can use the Open XML SDK 1.0 or the Open XML SDK 2.0. With the Open XML SDK 2.0, you can use advanced features such as document validation. To see the difference in performance by using this approach, click the Insert table using Automation button. This inserts 300 table rows with content controls in each cell by using Word Automation.

Performance Issues

Inserting Flat OPC XML is especially useful when inserting large amounts of data, such as when you are inserting multiple tables that contain images. There can be performance issues if you frequently retrieve the Flat OPC XML from the range. The XML document can be very large. It is better to cache the Flat OPC XML during its lifetime, and manipulate only the data and, optionally, modify related parts within the document.

Conclusion

This is a very powerful mechanism that you can use in scenarios where you must insert large amounts of data. You can do this completely by using the Open XML SDK 1.0 or the Open XML SDK 2.0 to insert data into open documents (by inserting a flat OPC document) or closed documents (by manipulating the Open XML directly).

Additional Resources

For more information, see the following resources:

About the Authors

Anil Kumar, Ansari M, and Sunil Kumar are consultants at Microsoft Global Services India. They specialize in developing solutions for Microsoft Office by using Open XML and Visual Studio Tools for Office and the Microsoft .NET Framework.

Sarang S. Datye works as an Application Development consultant with Microsoft Global Service India (MGSI). He designs and develops custom-solutions that use the Microsoft technology stack, and is primarily focused on customizing Microsoft Office with solutions based on the .NET Framework. He maintains his own blog at My Potential My Passion.