Share via


Basic Instincts

Server-Side Generation of Word 2007 Docs

Ted Pattison

Code download available at:BasicInstincts2006_11.exe(170 KB)

Contents

Word 2007 Document Internals
Generating Your First DOCX File
Generating DOCX Files on the Server

Until now, writing and deploying server-side applications that read, modify, and generate documents used by Microsoft® Office applications has presented a challenge. The older binary format used by Microsoft Word, Excel®, and PowerPoint® was introduced in 1997 and has remained the default file format up through the Office 2003 release. However, this binary file format has proven to be too tricky to work with. The vast majority of production applications that read and write Office documents do so by going through the object model of the hosting Office application.

Applications and components that use the object model of applications such as Word or Excel run much better on the desktop than they do in server-side scenarios. Anyone who has spent time writing the extra infrastructure code required to make an Office desktop application behave reliably on the server will tell you it's a less than desirable solution. That's because Office desktop applications such as Word and Excel were never designed to run on the server, and they require a custom utility program to terminate and restart them whenever they encounter a modal dialog that requires human intervention.

The ability to read and write Office documents without going through the object model of the hosting Office application becomes far more desirable in a server-side scenario. Office 2000 and Office 2003 introduced some modest capabilities for creating Excel workbooks and Word documents using XML. This advancement introduced the possibility of writing portions of an Office document using an XML library such as the one contained in the Microsoft .NET Framework provided through the System.Xml namespace.

In the 2007 Microsoft Office system, Microsoft has taken this idea much further by adopting the Office Open XML file formats for the documents used by Microsoft Office Word, Excel, and PowerPoint 2007. The Office Open XML file formats are exciting to ASP.NET and SharePoint® developers because they provide the ability to read, write, and generate a Word document, an Excel workbook, or a PowerPoint presentation on the server without requiring the system to run an Office desktop application on the server.

The Office Open XML file formats are on their way to becoming a published European Computer Manufacturers Association (ECMA) standard. You can read more about the ongoing standardization process, download the latest draft of the Office Open XML file formats specification, and find other great online resources at openxmldeveloper.com. My goal in this month's column is to introduce the programming required to create simple Word documents within an ASP.NET application and pass them back to users who will be able to open them directly with Word 2007.

Word 2007 Document Internals

Let's begin by examining the structure of a simple Word document based on the Office Open XML file formats. As you will see, the Office Open XML file formats are based on standard ZIP file technology. Each top-level file is saved as a ZIP archive, which means you will be able to open a Word document just as you would any other ZIP file and snoop around inside using the ZIP file support built into Windows® Explorer.

Note that the 2007 Microsoft Office system applications such as Word and Excel have introduced file extensions for documents that use the new format. For example, the .docx extension is used for Word documents stored in the Office Open XML file formats while the more familiar .doc extension continues to be used for Word docs stored in the binary format.

To see what I'm talking about, create a new document in Word 2007 and add "Hello Word" as the text. Save the document using the default format to a new file named Hello.docx and close Word. Next, locate Hello.docx using Windows Explorer and rename it Hello.zip. Open Hello.zip and see the structure of folders and files that Word has created inside (see Figure 1).

Figure 1 DOCX File is a ZIP Archive

Figure 1** DOCX File is a ZIP Archive **(Click the image for a larger view)

The top-level file (Hello.docx) is known as a package. Since a package is implemented as a standard ZIP archive, it automatically provides compression and it makes its contents instantly accessible to many existing utilities and APIs on both Windows platforms and other platforms alike.

Inside a package there are two kinds of internal components: parts and items. In general, parts contain content and items contain metadata describing the parts. Items can be further subdivided into relationship items and content-type items. A part is an internal component containing content that is persisted inside the package. The majority of parts are simple text files that are serialized as XML with an associated XML schema. However, parts can also be serialized as binary data when necessary, for example when a Word document contains a graphic image or a media file.

A part is named using a Uniform Resource Identifier (URI) that contains its relative path within the package file combined with the part file name. For example, the main part within the package for a Word document is /word/document.xml. Here are a few more typical part names you will find inside the package for a simple Word document:

  • /[Content_Types].xml
  • /_rels/.rels
  • /docProps/app.xml
  • /docProps/core.xml
  • /word/_rels/document.xml.rels
  • /word/document.xml
  • /word/fontTable.xml
  • /word/settings.xml
  • /word/styles.xml
  • /word/theme/theme1.xml

The Office Open XML file formats use relationships to define associations between a source and a target part. A package relationship defines an association between the top-level package and a part. A part relationship defines an association between a parent part and a child part. Relationships are important because they make these associations discoverable without having to examine the content within the parts in questions. That means relationships are independent of content-specific schema and are, therefore, faster to resolve. There is an additional benefit in that you can establish a relationship between two parts without having to modify either one of them.

Relationships are defined in internal components known as relationship items. A relationship item is stored inside the package just like a part, although a relationship item is not actually considered a part. For consistency, relationship items are always created inside folders named _rels.

For example, a package contains exactly one package-relationship item named /_rels/.rels. The package-relationship item contains XML elements to define package relationships such as the one between the top-level package for a .docx file and the internal part /word/document.xml, as shown here:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="../package/2006/relationships "> <Relationship Id="rId1" Type="../officeDocument/2006/relationships/officeDocument " Target="word/document.xml"/> </Relationships>

As you can see, a Relationship element defines a name, a type, and a target part. You should also observe that the type name for a relationship is defined using the same conventions used to create XML namespaces.

In addition to a single package-relationship item, a package can also contain one or more part-relationship items. For example, you define relationships between /word/document.xml and child parts inside a package-relationship item located at the URI /word/_rels/document.xml.rels. Note that the Target attribute for a relationship in a part-relationship item is a URI relative to the parent part and not the top-level package.

Every part inside a package is defined in terms of a specific content type. A content type is metadata which defines a part's media type, a subtype, and a set of optional parameters. Any content type used within a package must be explicitly defined inside a component known as a content type item. Each package has exactly one content type item named /[Content_Types].xml. An example of content type definitions defined inside /[Content_Types].xml is shown in Figure 2. You should observe from this figure that relationship items are like parts in that they are also defined with an associated content type.

Figure 2 Content Types Define Package

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Types xmlns= "https://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="rels" ContentType= "application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats- officedocument.wordprocessingml.document.main+xml"/> </Types>

Content types are used by the consumer of a package to interpret how to read and render the content within its parts. As you can see from Figure 2, a default content type is typically associated with a file extension, such as .rels or .xml. Override content types are used to define a specific part in terms of a content type that is different from the default content type associated with its file extension. For example, Figure 2 shows how /word/document.xml is associated with a content type that is different from the default content type used for files with the .xml extension.

Generating Your First DOCX File

While there are several existing libraries you can use to read and write to ZIP files, you should use the new packaging API that is part of the WindowsBase.dll assembly that ships with the .NET Framework 3.0 whenever possible because the packaging API is aware of the Office Open XML file formats. For example, there are convenient methods that make it easy to add relationship elements to a relationship item and to add content type elements to a content type item. The packaging API makes things easier because you never have to touch these elements and items directly.

This month's code uses the version of WindowsBase.dll that is installed along with the WinFX® Runtime Components February community technology preview (CTP). Note that WinFX Runtime Components were renamed to the .NET Framework 3.0 with the June CTP release. You can use either of these two versions or any later release of the .NET Framework 3.0. To test the code we are going to write in this column, you will also need to install Word 2007 Beta 2 or later. (Changes to Office since Beta 2 may require some modifications to the code presented here.)

Once you have installed the .NET Framework 3.0, which includes WindowsBase.dll, you can start programming against the packaging API in a Visual Studio® 2005 project by adding a reference, as shown in Figure 3.

Figure 3 Add Reference to WindowsBase.dll

Figure 3** Add Reference to WindowsBase.dll **(Click the image for a larger view)

The classes that make up the packaging API are contained inside the namespace System.IO.Package. When working with packages, you will also be frequently programming against older and (hopefully) familiar classes in the System.IO namespace and the System.Xml namespace. Examine the code in Figure 4 that shows the skeleton for creating a new package.

Figure 4 Creating a New Package

Imports System Imports System.IO Imports System.IO.Packaging Imports System.Xml Class GenerateDOCX Shared Sub Main() ‘*** Create a new package Dim pack As Package = Package.Open("c:\Data\Hello.docx", _ FileMode.Create, _ FileAccess.ReadWrite) ‘*** Write code here to create parts and add content ‘*** Close package pack.Close() End Sub End Class

The System.IO.Packaging namespace contains the Package class, which exposes a shared method named Open to be used to create new packages and open existing packages. As with other classes that deal with file I/O, a call to Open should always be complemented with a call to Close.

Once you have created the new package, the next step is to create one or more parts and to serialize content into them. In my example in this month's column, we will follow the guidelines for a "Hello World" application, which will require the creation of a single part named \word\document.xml. We can create a part by calling the CreatePart method on an open Package object and passing parameters for a URI and a string-based content type:

'*** Create main document part (document.xml) ... Dim uri As Uri = New Uri("/word/document.xml", UriKind.Relative) Dim partContentType As String = "application/vnd.openxmlformats" & _ "-officedocument.wordprocessingml.document.main+xml" Dim part As PackagePart = pack.CreatePart(uri, partContentType) '*** Get stream for document.xml Dim streamPart As New StreamWriter(part.GetStream(FileMode.Create, _ FileAccess.Write))

The call to CreatePart passes a URI based on the path /word/document.xml and the content type that's required by the Office Open XML file formats for the part in a word-processing document that contains the main story. Once we have created a part, we must serialize our content into it using standard stream-based programming techniques. The code you have just seen opens a stream on the part by calling the GetStream method and uses the resulting Stream object to initialize a StreamWriter object.

The StreamWriter object is going to be used to serialize the "Hello World" XML document into document.xml. Take a look at the following XML; it represents the simplest of XML documents that can be serialized in document.xml:

<?xml version="1.0" encoding="utf-8"?> <w:document xmlns:w= "https://schemas.openxmlformats.org/wordprocessingml/2006/3/main"> <w:body> <w:p> <w:r> <w:t>Hello Word 2007</w:t> </w:r> </w:p> </w:body> </w:document>

Note that all the elements in this XML document are defined in the schemas.openxmlformats.org/wordprocessingml/2006/3/main namespace as required by the Office Open XML file formats. The XML document contains a high-level document element. Within the document element there is a body element that contains the main story.

Within the body element there is a <p> element for each paragraph. Within the <p> element is an <r> which defines a run. A run is a region of elements that share the same set of characteristics. Within the run is a <t> element which defines a range of text, in this case "Hello Word 2007".

Now it's time to generate this XML document with code using the XmlDocument class from the System.Xml namespace. If you take a look at Figure 5 you'll see how the code creates these elements within the proper structure and using the appropriate namespace. You should also note how this code writes the XML document into the stream using the StreamWriter object and then closes the stream and flushes the serialized content into the Package object.

Figure 5 Generating the XML Document

‘*** Define string variable for Open XML namespace for nsWP: Dim nsWP As String = 'https://schemas.openxmlformats.org' &amp; _ '/wordprocessingml/2006/3/main' ‘*** Create the start part, set up the nested structure ... Dim xmlPart As XmlDocument = New XmlDocument() Dim tagDocument As XmlElement tagDocument = xmlPart.CreateElement('w:document', nsWP) xmlPart.AppendChild(tagDocument) Dim tagBody As XmlElement tagBody = xmlPart.CreateElement('w:body', nsWP) tagDocument.AppendChild(tagBody) Dim tagParagraph As XmlElement tagParagraph = xmlPart.CreateElement('w:p', nsWP) tagBody.AppendChild(tagParagraph) Dim tagRun As XmlElement tagRun = xmlPart.CreateElement('w:r', nsWP) tagParagraph.AppendChild(tagRun) Dim tagText As XmlElement tagText = xmlPart.CreateElement('w:t', nsWP) tagRun.AppendChild(tagText) ‘*** Insert text into part as a Text node Dim nodeText As XmlNode nodeText = xmlPart.CreateNode(XmlNodeType.Text, 'w:t', nsWP) nodeText.Value = 'Hello Word 2007' tagText.AppendChild(nodeText) ‘*** Write XML to part and close stream xmlPart.Save(streamPart) ‘*** Close stream and flush XML content into package streamPart.Close() pack.Flush()

Now we are finished writing the XML content into document.xml. The final step is to create a relationship between the package and document.xml by calling the CreateRelationship method of the Package object. This is easy as long as you know the correct string value for the relationship type and you can come up with a unique name (such as rId1) for the relationship being created:

'*** Create the relationship part Dim relationshipType As String = _ "https://schemas.openxmlformats.org" & _ "/officeDocument/2006/relationships/officeDocument" pack.CreateRelationship(uri, TargetMode.Internal, _ relationshipType, "rId1") pack.Flush() '*** Close package pack.Close()

You should also observe the call to Flush after the call to CreateRelationship. This call forces the packaging API to update the package relationship item with the proper Relationship element. The final call to the Package object's Close method completes the package serialization and releases the file handle on Hello.docx.

Now you have seen all the steps to generate a simple .docx file from a console application written in Visual Basic® 2005, all of which is available here. Now let's write the code to generate a .docx file on the server side from within an ASP.NET app.

Generating DOCX Files on the Server

The first thing to consider when modifying the code from the console application to run on a Web server is that you probably want to avoid having to save the package out as a physical file within the file system of the host computer. Instead, it will be faster to simply create the package in a memory stream within the ASP.NET worker process. Then you can write it back to the client using the OutputStream object of the ASP.NET Response object.

Let's start by changing the code to use a MemoryStream object instead of a physical file. Examine the code in Figure 6, which has been added to a server-side event handler inside an ASP.NET 2.0 page. Note that the first parameter to Package.Open has changed from the string-based file path to a MemoryStream object. This approach will allow you to reuse the same code for creating the package and its parts as we did earlier. However, you don't have to worry about creating and naming an OS-level file. This approach may also provide faster response times and better throughput in high-traffic scenarios.

Figure 6 Using a MemoryStream Object

‘*** Create in-memory stream as buffer Dim bufferStream As New MemoryStream() ‘*** Create new package in memory stream Dim pack As Package = Package.Open( _ bufferStream, FileMode.Create, FileAccess.ReadWrite) ‘*** This calls same code shown in Hello World example WriteContentToPackage(pack) ‘*** Save/close package object leaving DOCX file in MemoryStream pack.Close() ‘*** (1) SET UP HTTP HEADERS FOR RESPONSE ‘*** (2) WRITE PACKAGE CONTENT INTO RESPONSE BODY

The code you have just seen creates a MemoryStream object and then serializes a .docx file into it just as you would serialize a .docx file into a physical file. The code inside the custom WriteContentToPackage method was taken directly from the code inside the console application shown earlier. However, now it's creating a package and serializing it into a buffer in memory instead of into a physical file.

Once you have written the package into the MemoryStream object and called Close on the package object, your work with the packaging API is complete. All that's left to do is to set up the appropriate HTTP headers and write the package content into the body of the response that is being sent back to the client.

Let's start with the HTTP headers. You should call methods on the ASP.NET response object to clear any existing headers and to add a content-disposition header specifying that the response contains an attachment with a file name of Hello.docx:

Response.ClearHeaders() Response.AddHeader("content-disposition", _ "attachment; filename=Hello.docx")

Next, you must set up the encoding and MIME content type for the HTTP response and then write the binary content for the package into the body of the HTTP response. This can be accomplished using the code in Figure 7. Note that the last two calls to Response.Flush and Response.Close are required to make sure the entire package is completely written into the HTTP response.

Figure 7 Response.ClearContent

Response.ClearContent() Response.ContentEncoding = System.Text.Encoding.UTF8 Response.ContentType = 'application/vnd.ms-word.document.12' ‘*** Write package to response stream bufferStream.Position = 0 Dim writer As New BinaryWriter(Response.OutputStream) Dim reader As New Bina NO RESPONSE YET | ryReader(bufferStream) writer.Write(reader.ReadBytes(bufferStream.Length)) reader.Close() writer.Close() bufferStream.Close() ‘*** flush and close the response object Response.Flush() Response.Close()

As long as the client's machine has been configured to understand the MIME content type associated with a .docx file, the user will be presented with the usual options: open the document within Word or save it to the local hard drive.

If the user clicks on the Open button, the .docx file is opened automatically in Word as shown in Figure 8. It is interesting to note that up to this point the package has never been saved as a physical file to the file system. It's been stored only in memory both on the Web server and on the client desktop computer. If the user closes the document without saving it, it is as if it never existed at all. This makes this approach ideal for generating templates for letters, memos, customer lists, and any type of document you can think of.

Figure 8 Document Opens in Word 2007

Figure 8** Document Opens in Word 2007 **(Click the image for a larger view)

Send your questions and comments for Ted to instinct@microsoft.com.

Ted Pattison is an author, trainer and SharePoint MVP who lives in Tampa, Florida. Ted is writing a book titled Inside Windows SharePoint Services 3.0 for Microsoft Press and he delivers advanced SharePoint 2007 Training to professional developers through his company, Ted Pattison Group (https://www.TedPattison.net).