New default XML formats in the next version of Office

I’m Brian Jones, a program manager on the Word team. I’ve been at Microsoft for about 6 years, and have been working on XML support in Word and across Office for a good percentage of that time. I thought I’d set up this blog to talk with people about what we’re doing in the next version of Office around XML. When we first started talking about Office 2003 and the features we were going to provide around XML, there were a lot of misinterpretations. It was frustrating not having an easy way to answer questions, provide insight, and clear up any misunderstanding. I didn’t want to make the same mistake again, so I told everyone that I wanted to start blogging as soon as we announced the new "Microsoft Office Open XML Formats" (still getting used to the official name). The PR folks said they thought it would be ok, and they even decided to post some links to this site from the different marketing materials being released which is pretty cool.

I’ve been waiting a long time for this day, and it’s awesome that I’m able to talk about this so early in the product cycle. I made a post last week talking about Office 2003 XML, but that was just more of a test to see how this whole blog thing works. The real reason for setting up this blog was to talk about the new default XML formats in the next version of Office (although I’m sure I’ll spend a good amount of time talking about 2003 as well).

I’m hoping that people will have tons of comments and questions because I’m eager to spend time discussing this topic (I already do with the people I work with so why not branch out a bit). I’d like to find out what kinds of questions people have, and what kind of additional information or tools you’d like to see. The whole point of these new formats is for them to be open to anyone to work with, so I want to make sure we make it as easy as possible.

If you haven’t already read the press release, it’s probably worthwhile since it gives a good overview of everything that’s happening. It is a press release though, so you’ll have to deal with it coming more from a marketing angle. You should be able to find it up on the presspass site:

I didn’t want to make this first post too long, but I do want to go into some of the things I think are the most important to understand about these new formats. I’ll definitely spend more time in future posts digging deeper on these different topics, as well as going into the goals behind the formats.

Open XML Formats Overview

To summarize really quickly what’s going on, there will be new XML formats for Word, Excel, and PowerPoint in the next version of Office, and they will be the default for each. Without getting too technical, here are some basic points I think are important:

  1. Open Format: These formats use XML and ZIP, and they will be fully documented. Anyone will be able to get the full specs on the formats and there will be a royalty free license for anyone that wants to work with the files.
  2. Compressed: Files saved in these new XML formats are less than 50% the size of the equivalent file saved in the binary formats. This is because we take all of the XML parts that make up any given file, and then we ZIP them. We chose ZIP because it’s already widely in use today and we wanted these files to be easy to work with. (ZIP is a great container format. Of course I’m not the only one who thinks so… a number of other applications also use ZIP for their files too.)
  3. Robust: Between the usage of XML, ZIP, and good documentation the files get a lot more robust. By compartmentalizing our files into multiple parts within the ZIP, it becomes a lot less likely that an entire file will be corrupted (instead of just individual parts). The files are also a lot easier to work with, so it’s less likely that people working on the files outside of Office will cause corruptions.
  4. Backward compatible: There will be updates to Office 2000, XP, and 2003 that will allow those versions to read and write this new format. You don’t have to use the new version of Office to take advantage of these formats. (I think this is really cool. I was a big proponent of doing this work)
  5. Binary Format support: You can still use the current binary formats with the new version of Office. In fact, people can easily change to use the binary formats as the default if that’s what they’d rather do.
  6. New Extensions: The new formats will use new extensions (.docx, .pptx, .xlsx) so you can tell what format the files you are dealing with are, but to the average end user they’ll still just behave like any other Office file. Double click & it opens in the right application.

I’ll definitely go into a lot more detail on these different points in future posts. Just to summarize though, I’m really happy with these new formats so far. Microsoft will build a lot of functionality around these formats for years to come, but I also hope other people outside of Microsoft will take advantage of them, since anyone that wants to can. You can look inside the files, make modifications, generate new files, add content, remove content, or any other number of things that people would want to do with an Office file.

If you want some more information in a more official form, there are two whitepapers available. Here’s a brief overview of each one:



The Microsoft Office Open XML Formats: New File Formats for "Office 12"

This first whitepaper is a general overview of the file format, and is targeted at multiple audiences. It starts off with an introduction about what’s going on and also briefly touches on the history of the current binary formats and how we got to where we are today.


The Microsoft Office Open XML Formats: Preview for Developers

This paper talks more about the architecture of the formats and is targeted at developers. This paper has a similar introduction to the first (but from a slightly different angle). The last 7 or so pages of the paper go into solutions and what people can do with these files. It’s a great way to start thinking about the possibilities, and what types of things you can probably expect to see built on top of the format.


OK, that’s enough for now. Sorry this was such a long post, but I didn’t have time to make it shorter (I think that was Twain or Pascal?). I’m going to get some sleep, and then see what things people are curious to know more about. Talk to you all tomorrow.