The Easy Way to Assemble Multiple Word Documents
One of the most common requests we hear related to word processing documents is the ability to merge multiple documents into a single document. Today, I am going to show you how to leverage altChunks and version 2 of the Open XML SDK to easily create a robust document assembly solution in less than a thirty lines of code.
Scenario – Document Assembly
Imagine a scenario where I'm a developer for a book publisher company that specializes in education based books. In my company, we typically have one or more authors write content for a specific chapter within a given book. Each of these chapters is written as a separate document. In this case, my company wants to write a book on the solar system, where the book is divided into chapters that correspond to unique components of the solar system, like the different planets and the sun. My company has asked me to write a solution that will be able to merge all these documents, each representing specific chapters, into one final document or book.
Solution
Before I get into the details of my solution I want to talk about the two different approaches I can take to solve this problem:
- Use altChunks to merge documents together
- Manually merge documents together
By far the first option of using altChunks is the easiest method for merging multiple documents together. I think of altChunks as the "easy button" when it comes to importing external files into a document. Not only can altChunks import other WordprocessingML documents, but it can also import html, xml, rtf, or plain text.
Manually merging multiple documents together is feasible, but requires you to handle a number of issues. For example, you will need to manually merge and deal with conflicts related to styles, bullets and numbering, comments, headers and footers, etc. Perhaps sometime in the future I will write a series of posts talking about how to merge documents manually.
Eric White has already written a blog post on how to use altChunks for document assembly using version 1 of the Open XML SDK. My post will talk about using version 2 of the SDK.
If you just want to jump straight into the code, feel free to download this solution here.
Step 1 – Create a Template
For those of you who have read my previous posts you will know that setting up the right template is the first, and probably the most important, step in creating an Open XML format solution. This scenario is no exception.
The best way to accomplish this scenario is to create a template that represents the final look of the book I want to create. In this template I will merge a specific chapter in a specific location within the template. I can accomplish this task by taking advantage of content controls. Content controls provide an easy mechanism for specifying semantic regions within a document. In other words, content controls allow me to uniquely identify a specific region within a document.
In this case, I am going to add content controls within my template document that have the name of the chapter I want to add at that location. For example, as shown in the screenshot below, I have a content control that has the name "Earth." This name indicates that the chapter titled "Earth" needs to be merged in this location of the template.
Step 2 – Find Specific Content Controls
Now that I have setup the template I need to programmatically locate content controls based on the alias or name of the content control, which represents the title of the chapter I want to merge. This task is pretty easy with version 2 of the SDK. Once I open a Word processing document I can find all content controls, represented as SdtBlock, that have an alias value set to the source file I want to merge into the template with the following code:
MergeSourceDocument(string sourceFile, string destinationFile) { using (WordprocessingDocument myDoc = WordprocessingDocument.Open(destinationFile, true)) { MainDocumentPart mainPart = myDoc.MainDocumentPart; //Find content controls that have the name of the source file as // an Alias value List<SdtBlock> sdtList = mainPart.Document.Descendants<SdtBlock>() .Where(s => sourceFile .Contains(s.SdtProperties.GetFirstChild<Alias>().Val.Value)).ToList(); ... } } |
Step 3 – Add altChunk and Swap Out Content Control
I now need to swap out the found content control with the actual document I want to merge using altChunks. Merging documents using altChunks is pretty easy and consists of the following tasks:
- Add an altChunk part to the package
- Feed data from the intended merged document into the altChunk part
- Add altChunk reference in the main document part
The following code accomplishes those three tasks as well as the task to swap out the content control for the altChunk:
if (sdtList.Count != 0) { string altChunkId = "AltChunkId" + id; id++; AlternativeFormatImportPart chunk = mainPart.AddAlternativeFormatImportPart( AlternativeFormatImportPartType.WordprocessingML, altChunkId); chunk.FeedData(File.Open(sourceFile, FileMode.Open)); AltChunk altChunk = new AltChunk(); altChunk.Id = altChunkId; //Replace content control with altChunk information foreach (SdtBlock sdt in sdtList) { OpenXmlElement parent = sdt.Parent; parent.InsertAfter(altChunk, sdt); sdt.Remove(); } ... } |
End Result
Putting everything together and running my code, I will end up with a solar system book that is broken down into chapters representing unique components of the solar system. Using altChunks automatically ensures the following:
- Final document has consistent styles applied
- Images, comments, tracked changes, etc. are all included as part of my merged document
- Bullets and numbering just works
Here is a screenshot of the final solar system document:
[updated 1/9 due to bug in code - SdtProperties.GetFirstChild<Alias>() is the correct syntax]
Zeyad Rajabi
Comments
Anonymous
December 08, 2008
Zeyad Rajabi has written an interesting post on merging documents using altChunk. He shows how to assembleAnonymous
December 08, 2008
I would like to know how to distinguish the current format from the new format version. What element in the OOXML package will show that it is an ISO/IEC 29500:2008 conforming document ?Anonymous
December 09, 2008
If you've been at one of the recent DII workshops, you may recall that some of us from Microsoft haveAnonymous
December 10, 2008
Hi. Thank you for this new sample. But I've got one question and I can't find out any answer : as this SDK V2.0 is a CTP, can I use it as a referenced assembly in a commercial product ? I'm looking for the license text for CTP in any microsoft page ... without anything founded. Thant youAnonymous
December 11, 2008
Hi Gerlad, Unfortunately, version 2 of the SDK is still a CTP, which means you cannot reference this assembly in a commercial product. We are aiming to get this SDK final by the time O14 is released. In the meantime, you can still accomplish all your scenarios with version 1 of the SDK. Zeyad RajabiAnonymous
December 11, 2008
Parmi les posts techniques à ne pas manquer : Comment assembler des documents Word 2007 (utilisationAnonymous
December 11, 2008
Hm, interesting post, but it seems overkill for simple book assembly. Indeed, it doesn't like the method would save any labor over other techniques, as the content controls still have to be created manually, in addition to not insubstantial work of cobbling the snippets of code together into a functioning applet. Wouldn't it be simpler to use INSERTTEXT fields? They do the same things as the combination of content controls above, and they don't require any coding or compilation. (Admittedly, as I've pointed out before, they are anything but robust--but maybe Office 14 will change that? Maybe fields will be merged into smart content controls?) Still, I can see how the technique could be very useful where large numbers of complex documents are electronically assembled (without human input), e.g. government or corporate reports.Anonymous
December 11, 2008
Hi Zeyad and thank you for your answer. You're right : I can use the first version of the SDK, even if this second version seems much more confortable to use ... But I think that I could deal with this problem. I'm still waiting for a final release of the SDK V2 ... O14 should be delivered in the next 6 months, shouldn't it ?Anonymous
December 12, 2008
@Francis The content controls are simply added to provide semantic structure to the template document. They are not necessary to accomplish the scenario of using altChunks. That being said, IncludeText field is another approach for this scenario (thanks for suggestion). That being said, there are some fundamental differences between using this field code vs. altChunks. altChunks allows for the document to actually be merged within the template document, while using IncludeText links to the content of the file. Another difference is using IncludeText results in field code UI within Word, which may be overwhelming to the user (depending on how much content was merged). @Gerald It’s still too early to talk about RTM dates for O14. Sorry Zeyad RajabiAnonymous
December 12, 2008
Hi Brian, I have been reading your posts regarding Office Extensibility. I am presenting a situation in response to your following quote: "Again, if you have any specific scenarios or solutions you would like me to address in future posts please let me know" A common situation is to prepare Invoices, etc. from information in the database. Clients usually have extremely customized Invoice formats, but the data to be filled in is basically the same. I was trying to create a word document, with Tokens of the form [$TokenName$] in it, and replacing the tokens with actual text programatically. However, it was not as easy as I thought. Word 2007 splits up the token into multiple parts, depending upon regions to be checked for spellings. That makes the scenario almost impossible. Can you suggest a solution. Also, it would be helpful to be able to add rows to existing column based on a row template. In the same Invoice scenario, we leave a single row for Goods being delivered, and replicate that row replacing the Tokens for all Goods in the Invoice.Anonymous
December 16, 2008
Great suggestion Rahul. I will add this scenario to the queue of planned posts. By the way, it would be much easier if you were to leverage content controls instead of Tokens. More information to come soon. Zeyad RajabiAnonymous
December 17, 2008
Hi Zeyad. I've read information (a lot of information) about content controls but I found no answer to my main problem : can I use the content controls out of Word 2007 ? In other words, can I use the content controls, modify the Custom XML values by code and allow the users who don't have Word 2007 (but only Word 2003 for example) to get the docx updated ? I don't think that all the users have Word é007 installed on their computer and I wonder if another way is available to use these controls ... Thank youAnonymous
December 17, 2008
Gerald, Good question. No, Word 2003 and prior do not support content controls. That being said, I mentioned content controls to you as more of a tool you can utilize in your solution rather than an end user feature. Zeyad RajabiAnonymous
December 19, 2008
Brian Jones, you big hunk of techno-man you! You can share your post with me any time!Anonymous
January 05, 2009
Happy New Year! I hope everyone had a good holiday. For my first post of the New Year I want to talkAnonymous
January 19, 2009
First off I want to thank everyone for leaving comments suggesting future posts about the Open XML SDK.Anonymous
March 05, 2009
I just want to let you guys know we are working on some server issues here, which is why some of theAnonymous
March 13, 2009
There have been several requests made by people asking how to import a chart from one document type toAnonymous
March 26, 2009
While I finish up another blog solution, this time on importing a table from Word into Excel, I thoughtAnonymous
April 21, 2009
DocumentBuilder is an example class that’s part of the PowerTools for Open XML project that enables you