Chapter 22: Office Open XML Essentials
This content is outdated and is no longer being maintained. It is provided as a courtesy for individuals who are still using these technologies. This page may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.
This partial chapter is an excerpt from Advanced Microsoft Office Documents 2007 Edition Inside Out, by Stephanie Krieger from Microsoft Press (ISBN 13: 9780735622852 copyright Stephanie Krieger 2007, all rights reserved). No part of these chapters may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, electrostatic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher, except in the case of brief quotations embodied in critical articles or reviews.
XML Basics for Reading Your Documents
Getting to Know the Office Open XML Formats
About the Author
In my favorite novel, Alexandre Dumas' The Count of Monte Cristo, the imprisoned Abbe Faria wrote a book without access to paper or writing implements. A 19th century genius resourceful enough to turn MacGyver green with envy, the good Abbe fashioned a pen out of a fishbone, ink out of soot and wine, and 12 rolls of parchment from two shirts. Not bad for an old man who was locked away in a dungeon.
In previous versions of Microsoft Office, the idea of editing a document without first opening the program in which it's created is much like writing a book in a 19th century dungeon. Without the know-how of an Abbe Faria (or, in this case, a software engineer of equal talent), you're probably out of luck.
Well, thanks to the ingenuity of some talented software engineers, you no longer need to be a fictional genius (or hold an advanced computer science degree) to understand every bit of a document's structure well enough even to create one from scratch (if you're so inclined). Though you never have to know a thing about the XML behind your documents to use the 2007 Microsoft Office System, the benefits of getting to know the Office Open XML Formats can be great. Using the XML content for these new file formats, advanced Microsoft Office users can see and understand literally everything that goes into a 2007 Microsoft Office document.
As discussed at several points throughout this book, the transparency of the new file formats can save time, add flexibility, improve integration with external content, and simplify essential tasks such as protecting the private content in your documents or troubleshooting document problems. But, my favorite thing about these new formats is just the fact that you don't have to be a programmer to reap many of the aforementioned benefits.
It's important to reinforce that this is not an introduction to the XML programming language, but to the Office Open XML Formats. That said, in this primer, you'll learn to understand the structure of an Office Open XML Format document and how to edit documents directly in their XML. You will also learn the basics of how to customize the Ribbon and how to create custom XML for binding data to document content.
Though this primer is written for Microsoft Office users and not for programmers, you might have noticed that I've specified advanced users. That's because incorrectly editing a document's XML can break a document faster than you can blink.
That statement isn't meant to scare you away. If you're an advanced user, learning to edit your documents' XML can be easy, and you can just as easily learn to quickly fix anything you may inadvertently break. Rather, I mention this primarily for the trainers or tech support professionals among you who might consider sending basic or intermediate users into a document's XML.
You'll gain tremendous power and flexibility by being able to edit a document's XML, but please don't consider it just another method for accomplishing document tasks—such as just one more option you'd teach in a Microsoft Word course for how to edit the definition of a paragraph style. This is an avenue for those with the skill to take document production and troubleshooting to a new level.
Consider this: If a document were a car, then the document's XML would be the tools of an auto mechanic. Before you can begin to understand what's going on under the hood, you need to know how to drive. That said, experienced drivers are going to have a great time with the tools in this chapter.
XML Basics for Reading Your Documents
XML is a language used to contain and describe data. In the case of Office Open XML, the data is your document content, and the description includes the settings required for that document to function in the applicable program as well as the settings you apply to the document.
Before you begin to explore a document's XML, the subsections that follow provide a bit of background and basics to help you prepare for the task.
Reading a Markup Language
XML is a markup language. Just as you mark up a document while reviewing it—with comments, corrections, or margin notes—a markup language marks up data with descriptive information about that data.
If you've ever looked at the HTML source code of a Web page, you already have some experience with the type of language you'll see throughout this primer. However, instead of paired formatting codes wrapped around text that you see in HTML (such as <b>text</b> to turn bold formatting on and then off), the Office Open XML Formats use paired tags nested in a hierarchy that compartmentalizes, organizes, and defines everything you need to know about your document.
The following example shows the word text along with its formatting definition. This word is part of a paragraph but is separated out in the source code (the markup) because it contains unique formatting. The bullets that follow the XML code sample explain in detail how to read this sample.
<w:r> <w:rPr> <w:b /> </w:rPr> <w:t>text</w:t> </w:r>
The w: that begins each line indicates that this information is describing an Office Word 2007 document. You will see different prefixes in your Microsoft Office Excel 2007 and Microsoft Office PowerPoint 2007 documents. Also notice that each tag is surrounded by angle brackets (<>).
As with HTML source code, XML tags used to describe content are usually paired, and the second of the pair (the end tag) begins with a slash character.
The section of code shown in the preceding sample is known as a run, noted by the w:r that introduces the first line of code. A run is a region of document content that contains similar properties.
To complete the structure, the entire content of the paragraph to which the word text belongs is stored between the two ends of a higher-level paired tag, not shown here, that indicates the start and end of the paragraph (<w:p> and </w:p>). The collection of paragraphs (and any other content) in the body of the document is in turn positioned within another paired tag (<w:body> and </w:body>).
The second and fourth lines in the sample comprise a paired tag containing the formatting for the specified text. Notice that between those lines, the third line simply indicates that the specified text is bolded <w:b />.
Because formatting information in Office Open XML is stored in a structure that defines where the formatting is to be applied, the specific formatting itself doesn't need a paired tag. If the text for this sample were also italicized, for example, the tag <w:i /> would appear on its own line, also between the lines of the same paired tag that contains the bold statement. Also notice that, because the bold (or italicized) statements stand on their own, they include a slash at the end of the single tag to indicate that there is no end tag for this statement. You'll see the slash at the end of other tags throughout this primer, wherever the item is not paired.
The specified paragraph text appears on the fifth line, between a pair of tags (<w:t> and </w:t>) that indicate it's the text being described.
The last line in the preceding example is the end tag that indicates the end of the description for this specified text.
If the preceding example seems to be quite a lot of work for one word, don't lose heart. It's just an example of how you see Word formatting applied to text in the XML markup, used here to demonstrate how code in the Office Open XML Formats is spelled out. Though it also serves to demonstrate why working in the XML wouldn't be considered an equal alternative to the built-in program features for many document editing needs, that's not the reason for this example. Understanding how to read XML structure will help you work more easily when you begin to use a document's XML in ways that can simplify your work and expand the possibilities.
Don't worry about trying to memorize any specific tags used in the preceding example. The important thing to take away from this is the general concept of how the XML markup is structured. Everything in XML is organized and spelled out, like driving directions that take no turn for granted. So, though the example given might seem like a lot of code for very little content, the fact that it's organized explicitly is the very thing that will make the tasks throughout this primer easy to understand even to those who are new to XML.
If you look at the markup for one of your own documents, you may see code similar to the preceding example along with additional tags labeled w:rsidR and rsidRPr, each followed by a set of numbers. Those tags and their corresponding numbers are a result of the feature Store Random Number To Improve Combine Accuracy, which you can find on the Privacy Options tab of the Trust Center. (See Chapter 23 of this book, "Using VBA and XML Together to Create Add-Ins," for more on using the Trust Center.)
Unless you intend to use the Combine feature (available from the Compare options on the Review tab) with a particular document, there's no benefit to enabling this option (but it is on by default). For the sake of simplicity, since these tags are not essential to your documents, they're not included in any XML samples throughout this chapter.
Understanding Key Terms
I'll introduce terms as they arise for each task, but there are a few terms that can be useful to note up front.
The Office Open XML Formats are actually compressed folders containing a set of files that work together. ZIP technology (the .zip file extension) is the method used to compress the files into a single unit, and the set of files that comprise an Office Open XML Format document is referred to as the ZIP package.
Each file within the package is referred to as a document part.
When you read about XML, you often come across the word schema. An XML schema is a set of standards and rules that define a given XML structure. For example, multiple schemas are available for defining different components of Office Open XML, and you'll see reference to some of these in the document parts used for the tasks throughout this chapter. Anyone can freely use the schemas for the Office Open XML formats. Developers can also create their own custom schemas for custom document solutions. (Note, however, that creating schemas is an advanced XML skill that is beyond the scope of this chapter.)
XML Editing Options
Most professional developers use Microsoft Visual Studio for editing XML, but you certainly don't need to do that. You can use Notepad for the same purpose, or any of a wide range of programs from Microsoft Office SharePoint Designer 2007 to a number of freeware, shareware, and retail XML editors.
Many people who don't need a professional development platform for their work will use a freeware or shareware XML editor to see the XML hierarchy in a tree structure that's easy to read. When you edit XML in Notepad, it typically looks like running text with no manual line breaks.
For those who don't want to install another program for this purpose, you can use Microsoft Internet Explorer to view the XML in a hierarchical tree structure and easily find what you need, and then use Notepad to edit the XML. This is the approach I use for the examples throughout this primer.
You can also download XML Notepad 2007, a free XML editing tool from Microsoft. XML Notepad provides both an editor and a viewer, along with features such as drag-and-drop editing and error checking. However, using the editor in XML Notepad requires some knowledge of XML language structure. So, for those who are seeing XML for the first time in this chapter, start with the Microsoft Windows Notepad utility and consider moving up to XML Notepad once you get your bearings, if you find yourself yearning for a more structured editing environment.
That said, even if you're not using XML Notepad 2007 regularly to edit your code, it can be a handy tool for understanding the structure of your code and troubleshooting syntax errors, as discussed later in this chapter. So, you might want to download it sooner than later.
You don't need Microsoft Visual Studio 2008 for this purpose, but use it if you have access to it. Visual Studio 2008 knows the Office Open XML schemas, so you get timesaving benefits when you edit document parts using that application including IntelliSense (AutoComplete) lists and automatic syntax checking.
When you open an XML file in Internet Explorer, you're likely to see a bar across the top of the screen indicating that active content was disabled. Right-click that bar and activate content to be able to expand and collapse sections of your code by using the minus signs you see beside each level of code that contains sublevels. For example, here's what the code shown earlier looks like when viewed in Internet Explorer.
The same text in Notepad looks like this:
Troubleshooting:The document won't open after I edit an XML file, but I know my code is correct. Remember that a small syntax error (such as leaving off one of the angle brackets around a tag) in one XML file within a document can cause that document to be unreadable. However, if you know that the markup you typed is correct, there may be another reason that's just as easy to resolve. Some XML editors that display the XML markup in an easily readable tree structure may add formatting marks (such as tabs or line breaks) when you add code to that XML structure. When this happens, these formatting marks can be interpreted as a syntax error (just like a missing bracket) and cause the document to which that XML file belongs to become unreadable in its native program. If you don't know how to recognize unwanted formatting marks in your XML editor or if the file won't open in your XML editor, see the Inside Out tip titled "Using XML Notepad and Word to help find syntax errors" elsewhere in this book for steps to help you locate the error.
Getting to Know the Office Open XML Formats
This section will show you how to access the ZIP package for an Office Open XML Format document and how to begin to make sense of what you find there. For the best results, I suggest that you take each subsection that follows step by step and be sure you understand and feel comfortable with the content before continuing onto the next.
Breaking Into Your Document
Because each of your 2007 Microsoft Office System documents is actually a ZIP package in disguise, you can just change the file extension to .zip to access all of the files in the package. There are a few ways to go about this.
Append the .zip file extension to the existing file name. To do this in a Windows Explorer, or on the Windows Desktop, just click to select the file and then click again on the file name (this is slower than a double-click) to enter editing mode for the file name. For the same result, you can also press F2 once you select the file. Leave the existing file name and extension intact and just add .zip, so that you can open the package in a Windows Explorer to see its content.
When you change the file extension in a Windows Explorer or from the Windows Desktop, you'll see a warning that changing the file extension may make the file unusable. Just disregard this message and click Yes to confirm that you want to continue. (However, to protect your files, it's a good idea to save a copy of the document with its original file extension before appending the .zip extension or beginning to make changes in the XML.)
Renaming the file to a .zip extension is easier to do if you are viewing file extensions. If you don't see the extension for your Office Open XML Format file (such as .docx), change your setting in Windows Explorer to view all file extensions. To do this in Windows Vista, in any Windows Explorer window, click the Organize button and then click Folder And Search Options. On the View tab, turn off the option Hide Extensions Of Known File Types and then click OK. To find the same option when working in Windows XP, in a Windows Explorer, on the Tools menu, click Folder Options and then click View.
You can save a copy of your file with the .zip extension, while it's open in its source program, to bypass the step of changing the extension later. In the Save As dialog box, type the entire file name followed by .zip inside quotation marks. The file is still saved in whatever format is listed (so you still need to choose a macro-free or macro-enabled file format, for example, as needed), just as if you saved it first and appended the zip extension later. The only difference is that the file's ZIP package is immediately available to you without taking an additional step after you close the file. For example, to save a file named sample.docx as sample.zip, type "sample.zip" in the File Name box of the Save As dialog box.
You might also want to use a tool that enables you to edit the package without ever changing the extension from its original Office Open XML Format extension. One open source tool that I like to use for this purpose is 7-zip.
When you're editing the files in the ZIP package, you might not want to spend the time switching back and forth between the Office Open XML file extension (such as .docx) and the .zip extension. Well, you don't have to!
From the Open file dialog box in Word 2007, Excel 2007, or PowerPoint 2007, you can open documents that belong to the applicable program even when they're using the .zip file extension. To see your ZIP package file, just select All Files from the file type drop-down list beside the File Name text box and then select and open the file as you would when using its original extension. There's nothing else to it. Word 2007, Excel 2007, and PowerPoint 2007 know that the Office Open XML Formats are ZIP packages and read the XML within those packages whether the file is saved using .zip or a file extension that belongs to the program.
Note that you can also open the ZIP package in the appropriate program through the Open With options available when you right-click the ZIP package on the Windows Desktop or in a Windows Explorer. If you do this, just be careful not to accidentally set the applicable program as the default for opening this file type, or you'll add an extra step for yourself every time you want to access the document parts in the ZIP package.
However, for ease of use as well as for sharing documents with Microsoft Office users of all experience levels, it's a good idea to make sure the file extension is changed back to its original state once you've finished editing the files in the ZIP package.
The Office Open XML File Structure
Once you change the file name to have the .zip extension, open the file in a Windows Explorer. The example that follows walks you through the ZIP package of a simple Word document, originally saved with the .docx extension.
When you first view the ZIP package for a Word document in Windows Explorer, it will look something like this.
Note that, at the top level of the ZIP package that you see in the preceding example, Excel 2007 and PowerPoint 2007 files look very similar except that the folder named word in the example is named xl or ppt, respectively, for the applicable program.
The docProps folder is exactly what it sounds like—it contains the files for the document properties and application properties, ranging from author name to word count and software version.
The _rels folder contains a file named .rels, which defines the top-level relationships between the folders in the package. Note that additional relationship files may exist, depending on the document content, for files within a specific folder of the package (explained later in this section).
The relationship files are among the most important in the package because, without them, the various document parts in the package don't know how to work together.
The file [Content_Types].xml also exists at the top level of every document's ZIP package. This file identifies the content types included in the document. For example, in a Word 2007 document, this list typically includes such things as the main document, the fonts, styles, Theme, document properties, and application properties. Files with additional content types, such as diagrams or other graphics, will have additional content types identified.
Exploring a bit further, when you open the folder named word, you see something similar to the following image.
A new Word document contains XML files for the fonts, styles, settings (such as the saved zoom setting and default tab stops), and Web settings, whether or not formatting related to these items has been applied in the document. If headers, footers, footnotes, graphics, comments, or other content types have been added, each of them will have its own XML document part as well.
In the ZIP packages for Excel 2007 and PowerPoint 2007 files, you'll see a similar organization, with xml document parts for file components (such as styles.xml in Excel 2007 or tableStyles.xml in PowerPoint). Additionally, the xl folder in an Excel ZIP package contains a worksheets folder by default, because there is a separate xml document part for each sheet in the workbook. The ppt folder in a PowerPoint ZIP package also contains folders named slides, slideLayouts, and slideMasters, by default.
In addition to the XML document parts you see in the preceding image, notice the theme folder—which exists in the program-specific folder (word, xl, or ppt) for Word 2007, Excel 2007, and PowerPoint 2007 ZIP packages. The file contained in this theme folder contains all document Theme settings applied in the document. It is because of this file that you're able to share custom Themes by sharing documents, using the Browse For Themes feature at the bottom of each Themes gallery.
The _rels folder inside the program-specific folder defines the relationships between the parts inside the program-specific folder. The relationship file contained in this _rels folder is called document.xml.rels for Word documents, presentation.xml.rels for PowerPoint 2007 documents, and workbook.xml.rels for Excel 2007 documents.
Depending on the content in a given folder, its _rels folder might contain more than one file. For example, if a header exists in a Word document, the word folder contains a part named header.xml, and its _rels folder contains a file named header.xml.rels.
Content in your document from other sources (such as embedded objects, media files, or macros) are either stored in their original format (as is the case for picture files) or as a binary file (.bin file extension). Because of this, you can save time on many tasks related to working with media files (such as pictures), as discussed in the section "Editing and Managing Documents Through XML," elsewhere in this book.
As mentioned at the beginning of this section, the ZIP package shown in the two preceding images is for a docx file. Remember that the x at the end of the file extension indicates that it's a macro-free file format. If this were, instead, the package for a docm file, you would also see a file named vbaData.xml and one named vbaProject.bin.
If you return to the top level of the ZIP package and then open the docProps folder, the following is what you'll see.
By default, this folder contains the files app.xml (for application properties such as word count and program version) and core.xml (for document properties such as the Document Properties summary information like author and subject). Additionally, if you use the options to save a preview picture or a thumbnail for your document, you see a thumbnail image file in the docProps folder. For Word 2007 and Excel 2007, this will be a .wmf file and for PowerPoint 2007 it will be a jpeg file.
If you're running the 2007 Microsoft Office system on Windows Vista, you'll find an option in the Save As dialog box in Word 2007 or Excel 2007 to save a thumbnail image of your document. In PowerPoint 2007, or in all three programs when running Windows XP, you'll see the option Save Preview Picture in the Document Properties dialog box.
Taking a Closer Look at Key Document Parts
Let's take a look at the XML contained in a few of the essential document parts, to help accustom you to reading this file content.
The image you see below is the [Content_Types].xml file for the sample ZIP package shown under the preceding heading, as seen in a Windows Explorer.
The first line that you see in this or any XML file in an Office Open XML Format ZIP package will look very much like the first line in the preceding image. This line simply defines the type of XML structure being used.
Notice that the second line, which begins <Types…, is the first half of a paired tag for which the end tag is at the bottom of this document. All other lines in this file are the definitions of the content types in this document.
- On the second line, inside the Types tag, you see xmlns followed by a URL. The reference xmlns refers to an XML namespace, which is a required component in several document parts. Technical though this term might sound, a namespace is nothing more than a way to uniquely identify a specified item. The reason for this is that there can be no ambiguous names in the ZIP package (that is, the same name can't be used to refer to more than one item). So, the namespace essentially attaches itself to the content it identifies to become part of that content's name.
It's standard to use a Web address as the namespace, but note that the file doesn't attempt to read any data from the specified address. In fact, if you try to access some of the URLs you see in the files of an Office Open XML ZIP package, you'll find that some are not even valid addresses. Typically, the address in a namespace identifies the location of the source schema or other definitions used to define the structure of the items assigned to that namespace, and the Web page associated with that address may actually contain those definitions. But, any URL can be used as a namespace—the address itself is actually irrelevant to the code.
For the lines between the paired Types tags, notice that each defines one of the document parts you saw in the images of this sample ZIP package, under the preceding heading.
The first three lines in that group define the three file extensions included in this particular package, rels (the relationship files), wmf (the Windows metafile picture used for the document thumbnail), and xml.
The remaining lines in that group, each named OverridePartName, define the content type for each of the XML document parts that you saw in the word and docProps folders for this ZIP package. Take a look at just the first of the Override PartName lines, shown below. This one is for the main document content—the file document.xml that resides in the word folder.
Notice that the definition of the Override PartName that appears in quotation marks is actually the path to the specified file within the ZIP package. The content type definition that appears in quotation marks as the second half of that line of code is a reference to the content type definition defined in the applicable schema.
The following image shows you the content of the .rels file in the top-level _rels folder shown earlier for the sample ZIP package.
The .rels file should open without issue in Internet Explorer. But, if this doesn't work for you, append the .xml file extension to a copy of the .rels file, just for viewing purposes. Also note that, when in the ZIP package, files will only open in their default assigned program. To be able to open a document part in both Internet Explorer and Notepad, as needed, copy the file out of the ZIP package. Then, right-click the file and point to Open With to select the program you need.
Notice that, although the content of the .rels file is very different from the content of the [Content_Types].xml file, the concept of the structure is the same. That is, the first line defines the XML standard being used, and the second line opens the paired tag that stores the core file content and specifies a namespace for the content that appears between the lines of the paired tag.
Take a look at one of the relationship definitions from the .rels file—the one for the main document.xml document part. Notice that each relationship contains three parts—the ID, the Type, and the Target.
An ID is typically named rID#. This structure is not required, however, so you might occasionally see relationships with different IDs.
The Type uses a type defined in the applicable schema, which appears as a Web address. As with an XML namespace, the document doesn't need to read data from that address. However, in this case, the Type is a specified element of the applicable schema and does need to be a content type recognized by the Office Open XML structure.
The Target, as you likely recognize, is the address within the package, where the referenced file appears. When you create a relationship yourself, it's essential that this be correct, because the relationship will do no good if it can't find the specified file.
Depending on the content in your files, you might run across defined relationships in your .rels files that aren't used to specify files in the ZIP package and therefore might take on a slightly different structure for the relationship target. For example, notice the following relationship from a document.xml.rels file for a document that contains a hyperlink to the Microsoft home page.
Though the relationship ID and Type have the same structure as a relationship to a document part, notice that the target in this case is to an external hyperlink instead of a file in the package.
When you open a file in its originating program (Word 2007, Excel 2007, or PowerPoint 2007), keep in mind that the .rels files are the first place the program looks to know how to put the pieces together for the purpose of opening that file.
To continue reading this chapter, see Advanced Microsoft Office Documents 2007 Edition Inside Out.
About the Author
Stephanie Krieger is a Microsoft Office System MVP and the author of two books, Advanced Microsoft Office Documents 2007 Edition Inside Out and Microsoft Office Document Designer. As a professional document consultant, Stephanie helps many global companies develop enterprise solutions for Microsoft Office on both platforms. She also frequently writes, presents, and creates content for Microsoft. You can reach Stephanie through her blog, arouet.net.