Share via


Walkthrough: Word 2007 XML Format

 

Erika Ehrli
Microsoft Corporation

June 2006

Applies to:
    Microsoft 2007 Office Suites
    Microsoft Office Word 2007

Summary: Walk through the new default file format for Microsoft Office Word 2007. Read detailed descriptions of the file format architecture, key components, and ways in which you can programmatically modify content. (22 printed pages)

Contents

Introduction
Word 2007 Document Packages
Open Packaging Conventions for the Word XML Format
Interpreting Word 2007 Files
Identifying Non-XML Parts in Word 2007 Documents
Separating Content from Documents
Understanding the Data Store
Walkthrough: Creating a Word XML Format File
Conclusion

Introduction

Microsoft Office Word 2007 provides a new default file format called Microsoft Office Word XML Format (Word XML Format). This format is based on the Open Packaging Conventions, also used by the XML Paper Specification (XPS). The binary file format used in Microsoft Office 97 through Microsoft Office 2003 Editions is still available as a save format, but it is no longer the default when saving new documents.

Microsoft introduced XML into Microsoft Office XP in 1999 with SpreadsheetML in Microsoft Office Excel 2002. SpreadsheetML was a good start, but it did not provide full-fidelity. That is, it cannot describe every part of a spreadsheet. The next release of Microsoft Office products introduced WordprocessingML in Microsoft Office Word 2003. WordprocessingML was a huge step forward because it was the first full-fidelity XML file format provided by Microsoft Office. Using Microsoft Office 2003, you can parse WordprocessingML files and manipulate, update, or add data to them. However, a few limitations exist. For example, you must encode binary data (such as images) as text within the XML file itself, which increases file size when working with a document containing many images. Additionally, Word 2003 embeds all custom XML data directly into the WordprocessingML that describes the document. This can make the custom XML difficult to access and manipulate from external processes.

The new file format in Word 2007 solves these issues by dividing the file into document parts, each of which defines a part of the overall contents of the file. When you want to change something in the file, you can simply find the document part you want, such as the header, and edit it without accidentally modifying any of the other XML-based document parts. Similarly, all custom XML data is in its own part. Working with customer XML is now easier. This allows you to generate documents programmatically with less code. In addition to being more robust and making it easier to work with custom XML, the new file format is also much smaller than the binary file format. The new file format takes advantage of ZIP technology by using the Open Packaging Conventions. This article walks through the structure of a Word 2007 document in this new file format.

Word 2007 Document Packages

The file format in Word 2007 consists of a compressed ZIP file, called a package. This package holds all of the content that is contained within the document. Using the package format decreases file size for Office documents because of the ZIP compression. The new format is also more robust to errors in transmission or handling. It allows you to manipulate the file contents using industry-standard ZIP-based tools. An easy way to look inside the new file format is to save a Word 2007 document in the new default format and rename the file with a .zip extension. Double-click the file to open and view its contents.

Note   To understand the composition of a file based on Microsoft Office Open XML Formats (Office XML Formats), you may want to extract its parts. To open the file, it is assumed that you have a ZIP utility, such as WinZip, installed on your computer. To open a Word XML Format file in Word 2007:

  1. Create a temporary folder in which to store the file and its parts.

  2. Save a Word document (containing text, pictures, and so forth) as a .docx file.

  3. Add a .zip extension to the end of the file name.

  4. Double-click the file. It will open in the ZIP utility. You can see the document parts that are included in the file.

  5. Extract the parts to the folder that you created earlier.

  6. Integrated ZIP compression reduces the file size by up to 75 percent. Files are further broken down into a modular file structure that makes data recovery more successful and enhances security. The new format segments files into components that you can manage and repair independently. Files created in the new format also have a distinctive file extension for each application, depending on the file type.

    Table 1. File extensions for Word 2007 file types

    Word 2007 File Types Extension
    Word 2007 XML Document .docx
    Word 2007 XML Macro-Enabled Document .docm
    Word 2007 XML Template .dotx
    Word 2007 XML Macro-Enabled Template .dotm

Open Packaging Conventions for the Word XML Format

The Open Packaging Conventions specification defines the structure of Word 2007 documents using the new file format. For more information about open packaging conventions, see the Open Packaging Conventions also used by the XML Paper Specification.

To understand the structure of a Word 2007 document, you must understand the three major components of the new file format:

  • Part items. Each part item corresponds to one file in the un-zipped package. For example, if you right-click a Microsoft Office Excel workbook and choose to extract it, you see a workbook.xml file, several sheetn.xml files, and other files. Each of those files is a document part in the package.
  • Content Type items. Content type items describe what file types are stored in a document part. For example, image/jpeg denotes a JPEG image. This information enables Microsoft Office, and third-party tools, to determine the contents of any part in the package and to process its contents accurately.
  • Relationship items. Relationship items specify how the collection of document parts come together to form a document. This method specifies the connection between a source part and a target resource. Relationships are stored within XML parts in the document package, for example, /_rels/.rels.

The following sections explain how each of these components fit together in an Office XML Formats file.

Word 2007 Document Parts

To facilitate construction, assembly, and reuse of Word 2007 documents by third-party processes and tools, Word divides the contents of the package into several logical parts that each store a specific document part, for example:

  • Comments
  • Style definitions
  • List definitions
  • Headers
  • Charts
  • Diagrams
  • The main document body
  • Images

Word represents each of these document parts with an individual file within the package. These parts can consist of XML files, such as the document parts that contain the markup for the Word XML Format, as well as attached contents, such as pictures or OLE–embedded files in their native format. All of these are contained within the package. However, it is important to note that, with a few exceptions defined within the Open Packaging Conventions, the actual file directory structure is arbitrary.

The relationships of the files within the package, not the file structure, are what determine file validity. You can rearrange and rename the parts of a Word file inside its ZIP container, provided that you update the relationships properly so that the document parts continue to relate to one another as designed. If the relationships are accurate, the file opens without error. The initial file structure in a file in Word 2007 is simply the default structure created by Word to enable you to determine the file composition easily. Provided that you keep the relationships current, you can change this file structure.

For example, in Word 2007, the container file represents a document. Within the container file, there are parts that, when aggregated, compose the document. For example, a Word 2007 file could contain (but is not limited to) the following folders and files:

  • [Content_Types].xml. Describes the content type for each part that appears in the file.
  • _rels folder. Stores the relationship part for any given part.
  • .rels file. Describes the relationships that begin the document structure. Called a relationship part.
  • datastore folder. Contains custom XML data parts within the document. A custom XML data part is an XML file from which you can bind nodes to content controls in the document.
  • item1.xml file. Contains some of the data that appears in the document. Example of a custom XML data part.
  • docProps folder. Contains the application's properties parts.
  • App.xml file. Contains application-specific properties.
  • Core.xml file. Contains common file properties for all files based on the Open Packaging Conventions document format.

Figure 1 shows the file structure of a sample Word 2007 document.

Hierarchical file structure of a typical Word 2007 document

Figure 1. Hierarchical file structure of a typical Word 2007 document

You can replace entire document parts in order to change the content, properties, or formatting of Word 2007 documents.

Word 2007 Content Types

As mentioned previously, each document part has a specific content type. The content type of a part describes the contents of that file type. In the case of the XML parts that contain the markup defined by the Word XML Format, the content type can help you determine its composition.

A typical content type begins with the word application and is followed by the vendor name. In the content type, the word vendor is abbreviated to vnd. All content types that are specific to Word begin with application/vnd.ms-word. If a content type is an XML file, then the URI ends with +xml. Other non-XML content types, such as images, do not have this addition. Some typical content types are:

  • application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml

    Content type for a document part that describes endnotes within a Word document. The +xml indicates that it is an XML file.

  • application/vnd.openxmlformats-package.core-properties+xml

    Content type for a part that describes the core document properties. The +xml ** indicates that it is an XML file.

  • image/png

    Content type for an image. The +xml portion is not present—this indicates that this content type is not an XML file.

You can use all of these content types when manipulating the contents of a Word 2007 file programmatically. The Microsoft Windows Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime Components includes the System.IO.Packaging namespace, which allows you to add document parts, retrieve and update contents, or create relationships programmatically. For example, using the Microsoft WinFX System.IO.Packaging class, you can create a document part with the PackagePart.CreatePart method. The CreatePart method takes two string parameters; one for the URI of the new part and one for the content type of the part, as follows:

PackagePart packageNewPart = package.CreatePart(uriResourceTarget,
"application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml");

This code example creates a document part using the URI stored in the uriResourceTarget variable with a content type used for styles.

For more information about PackageParts, see the PackagePart Class reference documentation in the Microsoft Windows SDK.

Locating Content Types

This section contains a list of the most frequently encountered content types. Word 2007 describes each content type by a file or part, inside the package. The [Content_Types].xml file, at the root of the package, lists every part in the file and its ContentType object. For example, you might see something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<Types xmlns="https://schemas.openxmlformats.org/package/2006/content-types"> 
   <Override PartName="/word/footnotes.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/> 
   <Default Extension="png" ContentType="image/png"/> 
   <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> 
   <Default Extension="xml" ContentType="application/xml"/> 
   <Override PartName="/word/document.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> 
   <Override PartName="/word/numbering.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/> 
   <Override PartName="/word/styles.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/> 
   <Override PartName="/word/endnotes.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/> 
   <Override PartName="/docProps/app.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.extended-properties+xml"/> 
   <Override PartName="/word/settings.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/> 
   <Override PartName="/word/footer2.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/> 
   <Override PartName="/docProps/custom.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.custom-properties+xml"/> 
   <Override PartName="/word/footer1.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/> 
   <Override PartName="/word/theme/theme1.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.theme+xml"/> 
   <Override PartName=
   "/word/fontTable.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/> 
   <Override PartName=
   "/word/webSettings.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/> 
   <Override PartName="/word/header1.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/> 
   <Override PartName="/docProps/core.xml" ContentType=
   "application/vnd.openxmlformats-package.core-properties+xml"/> 
</Types> 

You can rename and rearrange all of these parts within the directory structure. The parts are listed here in their default locations with their default names to make it as easy as possible to understand what they are.

Inside the Word directory, off the root of the package, is the majority of the information describing the document. In this directory, you may find parts that represent a number of available content types.

Mapping Document Parts with Content Types

Every XML file in the file format is a document part. If you look inside the newly-formatted file of most Word files, within the directory structure you see files, or document parts, like /word/fontTable.xml and word/styles.xml. These files contain clear names that indicate their purpose (for example, font table and style parts). However, you can also change their names. Therefore, inside the [ContentTypes].xml file is a <Types> element that maps each content part with its respective content type. A [ContentTypes].xml file might consist of something like this:

<Override PartName="/word/styles.xml" ContentType=
   "application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/> 
<Override PartName=
   "/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> 

This indicates that the /word/styles.xml document part has a content type of /vnd.openxmlformats-package.core-properties+xml. The /docProps/core.xml document part has a content type of application/vnd.openxmlformats-package.core-properties+xml.

Relationships Between Document Parts

Relationships are one of the most important parts of the package because they record the connections between document parts. You can rename and move parts within the package's directory structure, but the relationships must remain intact to keep the file valid.

A relationship is a logical connection between two parts within the file package. For example, the root document part has a relationship of type https://schemas.openxmlformats.org/package/2006/relationships/header to a part with the content type application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml. This indicates the relationship between the parts is that the target part is a header for the originating part. Furthermore, the content type indicates that the contents are a Word 2007 header. This header part could then have its own relationships. For example, if the header contained a JPEG image, the header might have a relationship of type https://schemas.openxmlformats.org/officeDocument/2006/relationships/image to a part with the content type image/jpeg.

Within the package, relationships are always located inside a directory titled _rels. To find the relationships originating from any part, look for the _rels folder that is a sibling of that part. If the part has relationships, the _rels folder contains a file that has your original part name with a .rels extension appended to it. For example, suppose you want to see if a relationship exists for the officeDocument part, which has a content type of https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. By default, this part has a URI of /word/document.xml, so you would open the directory /word/_rels in the package and look for a file called document.xml.rels.

Every relationship has a source and a target. The source is the part after which the relationship is named. For example, all relationships inside document.xml.rels have document.xml as their source. Each .rels file contains a <relationships> element, inside which you find a <relationship> element for each target relationship containing the target relationship's id, it's target part, and the target part's content type. A typical <relationships> element inside a document.xml.rels file might look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<Relationships xmlns="https://schemas.openxmlformats.org/package/2006/relationships"> 
   <Relationship Id="rId3" Type=
   "https://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" Target=
   "docProps/app.xml"/> 
   <Relationship Id="rId2" Type=
   "https://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target=
   "docProps/core.xml"/> 
   <Relationship Id="rId1" Type=
   "https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target=
   "word/document.xml"/> 
   <Relationship Id="rId4" Type=
   "https://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties" Target=
   "docProps/custom.xml"/> 
</Relationships> 

Notice that each relationship element first specifies the relationship id, then the content type of the target, and finally the target document part.

Interpreting Word 2007 Files

This section walks you through the main set of document parts in a Word 2007 file that uses the new file format. It also outlines the relationships between these parts as presented using the default directory structure.

Understanding Root–Level Relationships

The first part of any file that uses the Word XML Format is a virtual document part, or the package itself, called the start part. From this start part, there are relationships to several top-level parts, which describe the contents of the document:

Table 2. Root-level parts, relationships, and content types

Part Name Relationship Type Content Type Optional?
Core Document Properties (as defined in the Open Packaging Conventions) https://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties application/vnd.openxmlformats-package.core-properties+xml Yes
Application-Specific Document Properties https://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties application/vnd.openxmlformats-officedocument.extended-properties +xml Yes
Custom OLE Document Properties https://schemas.openxmlformats.org/officeDocument/2006/relationships/custom-properties application/vnd.openxmlformats-officedocument.custom-properties +xml Yes
Main Document Part https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument application/vnd.openxmlformats-officedocument.wordprocessingml.main+xml No

These four default parts contain the primary document properties, as well as a reference to the root part for the document, that is, the main document body content.

Understanding Document–Level Relationships

From the main document part, there is a set of relationships to the document parts referred to by the main document, as shown in Table 3.

Note that most relationships below are prefixed with the following:

https://schemas.openxmlformats.org/officeDocument/2006/relationships/

Table 3. Document-level parts, relationships, and content types

Part Name Relationship Type Content Type Optional?
Style Definitions /styles application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml Yes
List Definitions /lists application/vnd.openxmlformats-officedocument.wordprocessingml.listDefs+xml Yes
Document Settings /settings application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml Yes
Headers /header application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml Yes
Footers /footer application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml Yes
Footnotes /footnotes application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml Yes
Endnotes /endnotes application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml Yes
Image(s) /image image/[image extension], such as image/png or image/jpeg Yes
Comments /comments application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml Yes
Font Table /fontTable application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml Yes
Custom XML Items /customXML application/xml Yes
Web Settings /webSettings application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml Yes

This list of parts is not complete. For example, it does not include shared parts such as OLE objects, Microsoft ActiveX controls, and digital signatures. However, it does provide insight into the typical Word XML Format structure in Word 2007.

Identifying the Package URI and Content Type Names

As described previously, programmatically, you can use a URI to refer to all of the relationships and almost all of the document parts. There are two types of URIs: one for document parts and another for relationships.

In the new Word XML Format, relationship URIs usually begin with the following:

https://schemas.openxmlformats.org/officeDocument/2006/relationships/

For example, note the relationship-type used for application-level properties:

https://schemas.openxmlformats.org/officeDocument/2006/relationships/
/extended-properties 

This URI includes the word officeDocument because the Open XML File Formats convention dictates these relationships.

The exceptions are relationship-types that begin with a base of https://schemas.openxmlformats.org/package/2006/relationships/. Notice that this base uses the word package instead of officeDocument, indicating that it conforms to the XPS Open Packaging Convention. One such relationship-type that uses this prefix in its URI is the following:

https://schemas.openxmlformats.org/package/2006/relationships/
metadata/core-properties

This URI describes properties specific to the file. Relationship URIs are predefined. You cannot change them.

URIs for document parts point to the document part inside the package. For example, the default URI for the document part containing the main information about the document is /word/document.xml. This indicates that the main document information is contained in a file called document.xml inside a folder called word off the root of the package. You can rename and move document parts inside the package, thereby changing the URI for the document part. It is very important to update the relationships when renaming or relocating document parts inside the package.

Identifying Non-XML Parts in Word 2007 Documents

All embedded parts in a Word 2007 document are in their native, default Word XML Format. Therefore, if you add a picture to a document, you can rename the document with a .zip extension, and open it as you would any ZIP file. Within the package, you can locate the picture and open it as well. If the picture is in a .png format, you can see and open a .png file directly from the package.

Similarly, if you embed a Microsoft Office Visio document inside a Word 2007 document, you can locate the file as a .bin file inside the package.

This creates many possibilities for developers with files stored on a server. Consider the scenario where a company has hundreds of documents on a server that all contain the same corporate logo image. If the corporate logo changes, you can implement a simple script to replace the old logo with the new logo for every document.

The default location for images in a package is the /word/media directory and the default location for embedded objects in a package is /word/embeddings.

Figure 2 shows the directory structure for a document that contains images and embedded objects.

Hierarchical file structure of a Word 2007 document containing images and embedded objects

Figure 2. Hierarchical file structure of a Word 2007 document containing images and embedded objects

Separating Content from Documents

The document part that maps to the content type specified by the following URI:

application/vnd.openxmlformats-officedocument.wordprocessingml.main+xml

defines most of the document structure. In macro-enabled files, the part that maps to application/vnd.ms-word.template.macroEnabledTemplate.main+xml defines most of the document structure. In the previous code example from the [Content-Types].xml file, this content type maps to the document.xml part in the /word/document.xml folder.

This part contains XML that is similar to a subset of WordprocessingML used in Word 2003. There are elements for paragraphs, properties, and fonts that describe the basic structure of the document. Individual parts describe all the components of the document, such as headers, footers, lists, and endnotes. By default, most of these parts are siblings of the following content type file:

application/vnd.openxmlformats-officedocument.wordprocessingml.main+xml

If you look at the previous code example, from the [Content-Types].xml file, you see many of these parts listed.

This separation of content from formatting makes working on elements of a document programmatically a much easier task than in previous versions. Using the WinFX System.IO.Packaging class, you can modify the file with a few lines of code and perform tasks such as:

  • Replacing an old corporate logo used in hundreds of documents on a server with a new logo. Simply locate the image, delete it, and replace it with the new image.
  • Updating all the footers in documents on a server with an updated company name.
  • Changing the style of all the text in documents on a server to reflect a new corporate font.

There are, of course, many more possibilities. With this content separation, locating the part to edit is much easier than it is with the WordprocessingML used in Word 2003. In a WordprocessingML file, the entire document is described in one giant XML file. Parsing the file to make the change can be difficult. It also can be risky, because if a mistake is made, the entire document may become corrupt. In contrast, if one part of a Word 2007 document is corrupt, the rest of the document should open without error.

Understanding the Data Store

Similar to many other types of data in the Word XML Format, custom XML data is stored separately from the rest of the document. Each item is stored as a separate part within the package, and by default, this data appears in a folder called customXml located off the root of the package. If you attach an XML file to a document programmatically by adding a new part to the document's CustomXMLParts collection, then by default that XML data is stored in a file called /customXml/item1.xml. If you add a custom XML data from another file, then, by default, it is stored in a file called /customXml/item2.xml.

By using XMLMapping and XPath expressions, you can map specific elements of an XML part to a content control. This means that to modify or change custom XML programmatically, you do not need to parse through a large WordprocessingML file, as you did in Word 2003. Instead, you find the part holding the custom XML that you want and modify only that part of the file.

To add custom data to your document, you need to create a custom XML file and add it to the ZIP package. You also need to create the corresponding relationship from the main document part to your custom XML part.

In the Word XML Format in Word 2007, each custom part persists in its own XML part within the document container. The custom part contains the file name and its relationship information. The XML part is stored off the root of the document container in a folder called customXml.

Figure 3 shows the directory structure for a document that contains custom XML data.

Hierarchical file structure of a Word 2007 document containing custom XML data

Figure 3. Hierarchical file structure of a Word 2007 document containing custom XML data

Isolating custom XML data inside a document package allows you to read and update custom data without modifying or working with other document parts.

The relationship file, stored inside a _rels folder, describes all the relationships from one XML part to all other XML parts within a Word XML Format document. There are two relationship types for custom XML parts.

The relationship type for the XML is:

https://schemas.openxmlformats.org/officedocument/2006/relationships/customXmlData

The relationship type for the XML properties is:

https://schemas.openxmlformats.org/officedocument/2006/relationships/customXmlProps

An ID is stored with each relationship, allowing you to identify it uniquely within the data store.

The actual custom XML part is stored in its own file alongside the _rels folder. Each custom XML part has a file name of item##.xml and its own properties, named itemProps##.xml. In both file names, ## is the number (1, 2, 3. . .) of the custom XML part in the data store. The file format for the item##.xml custom XML part looks like the following.

<o:dataStoreItem> 
   <o:dataStoreItem o:itemID="MSXID for the custom XML part"/> 
   <o:xmlSchemaRef o:relID="relationship ID to a schema"/> 
</o: dataStoreItem> 

Walkthrough: Creating a Word XML Format File

Document.xml is the only required part in the Word XML Format. For information about how to create a minimal document with only this required part, see the section Creating the Document.

To illustrate how document parts, content type items, and relationship items work together, this section walks through the process of building a more elaborate Word XML Format document in Word 2007. This walk through helps illustrate how to access and alter document contents using the Word XML Format.

To create a Word 2007 document that contains content type and relationship items, you need to create a root folder that contains a specific folder and file structure, as shown in Figure 4.

Folder and file structure for a Word 2007 document

Figure 4. Folder and file structure for a Word 2007 document

After you create all folders and files, the next section walks your through adding the required XML code to each document part.

Creating the Document Properties

First, you need to create two XML files for the document properties:

  1. Create a folder and name it root.

  2. Create a folder inside the folder root and name it docProps.

  3. Open Notepad or any other XML editor.

  4. Copy and paste the following code into a new file and save it as app.xml inside the docProps folder:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
    <Properties xmlns=
       "https://schemas.openxmlformats.org/officeDocument/2006/extended-properties" 
       xmlns:vt="https://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"> 
         <Template>Normal.dotm</Template> 
         <TotalTime>1</TotalTime> 
         <Pages>1</Pages> 
         <Words>3</Words> 
         <Characters>23</Characters> 
         <Application>Microsoft Office Word</Application> 
         <DocSecurity>0</DocSecurity> 
         <Lines>1</Lines> 
         <Paragraphs>1</Paragraphs> 
         <ScaleCrop>false</ScaleCrop> 
         <Company>MS</Company> 
         <LinksUpToDate>false</LinksUpToDate> 
         <CharactersWithSpaces>25</CharactersWithSpaces> 
         <SharedDoc>false</SharedDoc> 
         <HyperlinksChanged>false</HyperlinksChanged> 
         <AppVersion>12.0000</AppVersion> 
    </Properties> 
    
  5. Open Notepad or any other XML editor.

  6. Copy and paste the following code into a new file and save it as core.xml inside the docProps folder:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
    <cp:coreProperties xmlns:cp=
       "https://schemas.openxmlformats.org/package/2006/metadata/core-properties" 
       xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" 
       xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:xsi=
       "http://www.w3.org/2001/XMLSchema-instance"> 
       <dc:title></dc:title> 
       <dc:subject></dc:subject> 
       <dc:creator>Your name</dc:creator> 
       <cp:keywords></cp:keywords> 
       <dc:description></dc:description> 
       <cp:lastModifiedBy>Your name</cp:lastModifiedBy> 
       <cp:revision>2</cp:revision> 
       <dcterms:created xsi:type="dcterms:W3CDTF">2006-05-03T01:13:00Z</dcterms:created> 
       <dcterms:modified xsi:type="dcterms:W3CDTF">2006-05-03T01:14:00Z</dcterms:modified> 
    </cp:coreProperties> 
    

Creating the Document

Next, you need to create an XML file for the document part. This is the only required part in the new Word XML Format.

  1. Create a folder and name it root.

  2. Create a folder inside the root folder and name it word.

  3. Open Notepad or any other XML editor.

  4. Copy and paste the following code into a new file and save it as document.xml inside the word folder:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
    <w:document xmlns:ve="https://schemas.openxmlformats.org/markup-compatibility/2006" 
       xmlns:o="urn:schemas-microsoft-com:office:office" 
       xmlns:o12="https://schemas.microsoft.com/office/2004/7/core"
        xmlns:r="https://schemas.openxmlformats.org/officeDocument/2006/relationships" 
       xmlns:m="https://schemas.microsoft.com/office/omml/2004/12/core" 
       xmlns:v="urn:schemas-microsoft-com:vml" 
       xmlns:wp="https://schemas.openxmlformats.org/drawingml/2006/3/wordprocessingDrawing" 
       xmlns:w10="urn:schemas-microsoft-com:office:word" 
       xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/3/main"> 
       <w:body> 
         <w:p> 
           <w:r w:rsidR="002847EC"> 
            <w:t>Word 2007 rocks my world!</w:t> 
           </w:r> 
         </w:p> 
       </w:body> 
    </w:document> 
    

Creating a Relationship

Next, you need to create a relationship to this part. This relationship is documented in the root _rels folder, which means that the relationship is off the root (or start part) of the package. To create the relationship:

  1. Create a folder inside the folder root and name it _rels.

  2. Open Notepad or any other XML editor.

  3. Copy and paste the following code into a new file and save it as .rels ** inside the _rels folder:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
    <Relationships xmlns="https://schemas.openxmlformats.org/package/2006/relationships"> 
      <Relationship Id="rId3" Type=
       "https://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
       Target="docProps/app.xml"/> 
       <Relationship Id="rId2" Type=
       "https://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
       Target="docProps/core.xml"/> 
       <Relationship Id="rId1" Type=
       "https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
       Target="word/document.xml"/> 
    </Relationships> 
    
  4. Notice that this XML creates a relationship of type officeDocument with ID rId1 to the document.xml file in the folder named word.

Defining the Content Type

Next, you need to define the content type of this file.

  1. Note that the structure of a content type definition file looks like the following:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
    <Types xmlns="https://schemas.openxmlformats.org/package/2006/content-types"> 
       <Default Extension="rels" ContentType=
       "application/vnd.openxmlformats-package.relationships+xml"/> 
       <Default Extension="xml" ContentType="application/xml"/> 
       <Override PartName="/word/document.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> 
       <Override PartName="/word/styles.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/> 
       <Override PartName="/docProps/app.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.extended-properties+xml"/> 
       <Override PartName="/word/settings.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/> 
       <Override PartName="/word/theme/theme1.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.theme+xml"/> 
       <Override PartName="/word/fontTable.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/> 
       <Override PartName="/word/webSettings.xml" ContentType=
       "application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/> 
       <Override PartName="/docProps/core.xml" ContentType=
       "application/vnd.openxmlformats-package.core-properties+xml"/> 
    </Types> 
    
  2. Open Notepad or any other XML editor.

  3. Copy and paste the above code into a new file and save it as [Content_Types].xml ** inside the root folder.

    Note   This reserved file name is used by the Open Packaging Conventions to define the content types of all files in the package.

Creating the Package

Finally, you can put these files into a ZIP package to create a valid Word 2007 document:

  1. Using any ZIP utility, save all the content of the simpledocument folder into a ZIP archive, including the following subfolders: the docProps folder, the word folder, and the _rels folder. Also include [Content_Types].xml.

    IMPORTANT   Do not simply add the complete simpledocument folder to a ZIP file or you get an internal error while opening the file in Word 2007. You need to specifically add all the subfolders of the simpledocument folder to the ZIP archive.

  2. Save the archive as simpledocument.docx.

Now, you can open this file in Word 2007 and see the contents of the package:

Simpledocument.docx in Word 2007

Figure 5. Simpledocument.docx in Word 2007

Conclusion

When compared to the binary file format used in previous versions of Word, the new Word XML Format in Word 2007 offers many strong benefits. The compression offered by the ZIP container results in much smaller file sizes. The files are also much more robust—if a portion of the file becomes corrupt, the compartmentalization of the different document elements allows the file to open, even if one part is damaged.

It is also easier to change, add, or delete data in a Word 2007 file programmatically or manually. The file is easily accessible using the Microsoft WinFX System.IO.Packaging class. You can modify documents on a server with only a few lines of code. You can readily access and manipulate custom XML data from its own separate parts. You can even use events to trigger the change of XML data. For example, you can map a content control to an XML element containing a stock quote and then retrieve the most recent quote programmatically each time the document opens, thereby ensuring that the user always sees the current price.

The possibilities and ease with which you can program against the new Word XML Format are impressive and mark a significant advancement in Microsoft Office.

Additional Resources

To keep current with the latest on Word 2007 and the new file format, see the following resources:

Acknowledgments

Thanks to Frank Rice, Mark Iverson, and Tristan Davis for their contributions to this article.