Transforming Word Documents into the XSL-FO Format
Alexei Gagarinov
RenderX
Mark Iverson
Microsoft Corporation
February 2005
Applies to:
Microsoft Office Word 2003
Microsoft Office 2003 Editions
Summary: Learn how to transform Word documents into the XSL Formatting Objects (XSL-FO) format. From the XSL-FO format, you can convert documents into formats such as Adobe Portable Document Format (PDF) and Hypertext Markup Language (HTML). (7 printed pages)
Download OfficeWordWordMLtoXSL-FOSample.exe.
Contents
Introduction to WordprocessingML and XSL-FO Formatting
Creating an XSL-FO Document from Word
Word Features Supported by the XML-FO Style Sheet
Limitations to Transforming Documents from Word into the XSL-FO Format
Examining the XSL-FO Style Sheet
Extended Style Sheet for Transforming Word Documents to a XSL-FO Format
Conclusion
Additional Resources
Introduction to WordprocessingML and XSL-FO Formatting
Microsoft made customizing Microsoft Office Word documents much easier and simpler when it introduced a new XML file format called WordprocessingML with Microsoft Office Word 2003. WordprocessingML is an XML representation, or schema, of the Word document file format. The flexibility of XML allows you to export data seamlessly from a Word document using standard XML representations of objects in a document. The new WordprocessingML schema also provides easy access to contents of Word documents without programming efforts or knowledge of the internal format of a Word document.
XSL Formatting Objects
In 2001, before the advent of WordprocessingML, the W3C endorsed an XML formatting language known as XSL Formatting Objects (XSL-FO). XSL-FO is synonymous with eXtensible Stylesheet Language (XSL), one of three recommendations by the W3C's XSL working group. The three recommendations are:
- XSL Transformations (XSLT), for transforming XML files
- XML Path Language (XPath), for defining parts of an XML document
- Extensible Stylesheet Language (XSL), or XSL-FO, for XML formatting information
XSL-FO is an intermediate form that results from applying an XSLT style sheet to an XML structured document. The XML-FO form describes how pages appear when presented to a reader, such as a Web browser. Currently, there are no readers that directly interpret an XSL-FO document. To interpret them, you must run them through a formatter, along with other data, such as graphics and font metrics, to create a final displayable or printable file. Possible formats for the resulting file include Adobe's Portable Document Format (PDF) and Hypertext Markup Language (HTML).
When compared to Cascading Style Sheets (CSS), XSL-FO provides a more sophisticated visual layout model. You can use CSS to apply specific style elements to an XML or HTML document. By contrast, XSL-FO is a language for describing a complete document. It includes everything needed to paginate and format a document. Some of the formatting supported by XSL-FO, but not by CSS, includes right-to-left and top-to-bottom text, footnotes, margin notes, page numbers in cross-references, and more. Note that while CSS is primarily intended for use on the Web, XSL-FO is designed for broader use. As an example, you could use an XSL-FO document to lay out an XML document as a printed book. You could write a completely separate XSL-FO document to transform the same XML document into HTML.
Using XSL-FO and WordProcessingML to Transform Documents
The advent of WordprocessingML created a powerful use for XSL-FO. You can use WordprocessingML and XSL-FO to transform Word documents into other viewable or printable industry adopted formats, such as PDF or HTML. The promise of XSL-FO as an intermediate format led to the creation of several tools that facilitate the transformation of XSL-FO documents into PDF, HTML, and other types of files. This article specifically focuses on various style sheets created explicitly for converting WordprocessingML files into XSL-FO documents.
Creating an XSL-FO Document from Word
The style sheet presented with this article is designed for using WordprocessingML to convert documents to the XSL-FO format. There are two simple ways to obtain an XSL-FO document directly from Word using the style sheet.
- Save a Word document as XML by using the Save As command in Word and then use a third-party parser to convert it into an XSL-FO document.
- Apply a transformation when using the Save As command in Word. To do this:
In the Save As dialog, check the Apply transform box.
Note Ensure that the Save data only checkbox is cleared. If Save data only is checked when you apply the transform, then Word discards the formatting stored in the document.
Click Transform. . ..
In the Choose an XML Transform dialog, browse to the Word2FO.xsl style sheet and click on Open.
Each of these two methods results in a file with an XML extension. While you may expect something like a .FO extension, the XML extension is intentional. The Word2FO.xsl style sheet creates an XSL-FO document regardless of the resulting extension.
Word Features Supported by the XML-FO Style Sheet
The following sections describe the Word features supported in the current version of the XML-FO style sheet. They do not discuss specific implementation details or techniques. For a more extensive examination of style sheet design and workings, view the transforms in a text editor and read the comments included in the style sheet.
Word Styles
Styles are the recommended tool for altering the appearance of your document. Properly designed styles make your document consistent, simplify maintenance, and provide better results when transferring data to other Microsoft Office applications or converting them into XSL-FO documents.
The Word2FO.xsl style sheet supports the use of styles in Word, including paragraph and inline styles, style derivation, and style overrides on specific elements.
Text Formatting
The style sheet provides support for the following text formatting properties:
Font attributes. Family, size, weight, slant, sub/superscripts, letter spacing, and color
Properties for spans of text. Decorations, case transformations, shadowed text
Note Not all decoration styles are supported.
Paragraph Formatting
The style sheet provides support for the following paragraph properties:
Alignment, margins, and indentation
Keeping lines of a paragraph together on a page or in a column
Forcing a page break before a paragraph
Widow and orphan control
Note A widow is the last line of a paragraph printed by itself at the top of a page. An orphan is the first line of a paragraph by itself at the bottom of a page.
Borders, padding, and background
Note Not all border styles are supported.
Lists
The style sheet provides support for numbered, bulleted, and multi-level lists.
Tables
The following tabular element properties are supported:
- Horizontal and vertical spans (merged cells)
- Explicit setting of column widths
- Borders, padding, and background
Images
The style sheet provides support for inline bitmap graphics, both embedded and linked from an external file.
Note Floating graphics and vector drawings are not supported in the current version of the style sheet.
Footnotes
Currently, the style sheet provides support for only the simplest footnote-numbering scheme.
Note Footnote layout may differ from that of Word, especially in multi-column sections.
Hyperlinks
The style sheet provides support for internal and external hyperlinks.
Page Headers and Footers
The style sheet supports the use of page headers and footers. The implementation is subject to a number of limitations. For more information, see Using Page Headers and Footers.
Sections
The style sheet provides support for sections. Not all Word layouts render correctly in XS-FO format. Specifically, XSL-FO has no provision for support of continuous section breaks. However, some XSL-FO rendering engines provide extensions to the XSL-FO specification in this area. For example, RenderX includes examples of using RenderX XEP software to support continuous section breaks in Word.
Limitations to Transforming Documents from Word into the XSL-FO Format
Both XSL-FO and WordprocessingML are powerful languages for describing richly formatted content. Their scopes overlap but do not match exactly. Therefore, in order to achieve maximum fidelity in layout patterns it is important to have a clear idea about the capabilities and limitations of transforming documents from one format into the other format. The following sections provide an overview of the differences, and some formatting tips for Word users to follow in order to achieve maximum visual fidelity.
Using Tab Marks to Lay Out Text
Many Office users use the TAB key to alter spacing, create indentations, or create table structures. While this method is acceptable with Word, there is no easy way to reproduce the same behavior with XSL-FO because it lacks the equivalent of a tab mark. Although the Word2FO.xsl style sheet provides some methods to represent tab stops, the resulting output may not look exactly like the original Word document. Users are encouraged to use different formatting techniques instead of tab marks. For example:
- To create first-line indents or negative indents in paragraphs. Replace tab marks by using explicit indentation.
- To create table structures. Replace tab marks with explicit tables.
- To create dotted rules in a table of contents, fill-in form fields, and so on. You can approximate this by applying a specific Underline style to a sequence of spaces, though this lacks the "stretch ability" found with the dotted rules in a table of contents formatted using tab marks.
Using Page Headers and Footers
In Word, the size of header and footer areas can change dynamically. The central region of the page increases or decreases when the contents of a side region changes. The XSL-FO format has no means to express dynamically changing header and footer areas. In the XSL-FO format, side regions must have fixed dimensions, regardless of their actual contents. Consequently, you must reserve the space for headers and footers manually by adjusting page margins.
To adjust page margins. Drag the top or bottom ruler up or down, so that the contents of the header or footer fits within the allotted space.
Note If there are headers or footers with different sizes within a section, you must break the section into multiple sections. Margins must be the same for each page within a section.
Other Limitations to Transforming Documents to XSL-FO Format
The style sheet does not support the following:
Floating graphics and vector drawings (VML).
Translating complex borders into the XSL-FO format because they are not defined in the XSL-FO specification.
Reproducing adjacent continuous sections with different column counts by using XSL-FO elements only.
Note The third-party rendering engine, RenderX, offers an extension element for this purpose. This extension element is used in the Word2FO_ext.xsl style sheet to implement this feature.
Translating custom user schema. You must remove user tags before converting the file.
In addition, the style sheet supports only the "PAGE", "PAGENUM", and "REF" fields.
Examining the XSL-FO Style Sheet
The style sheet uses no extension elements so it remains usable by any XSL-FO rendering engine. The style sheet is designed to process WordprocessingML generated by Word. When the actual markup used by Word differs from the specification, the style sheet favors the real implementation, rather than the theory. For more information about specific implementation choices, see the comments within the style sheet.
Structure of the XSL-FO Style Sheet
The style sheet is divided into several sub-style sheets for ease of maintenance and support. The following are brief descriptions of the files:
- Word2FO.xsl. The main style sheet that defines global constants and processes contents of a Word XML document.
- pageLayout.xsl. Defines physical page layouts and creates page sequences.
- elementStructure.xsl. Defines presentation and structure of the resulting XSL-FO document.
- elementProperties.xsl. Controls translation of properties.
- profile.xsl. Contains parameters to adjust the text layout of the resulting XSL-FO document.
- Word2FO_ext.xsl. Contains RenderX extensions to the XSL-FO format. Intended for use with RenderX XEP formatter.
Global Parameters in the XSL-FO Style Sheet
A number of parameters, defined as global constants, determine the behavior of the style sheet profile.xsl. Table 1 is a list of available global variables and their descriptions. The values are selected to match Word 2003 choices as closely as possible.
Table 1. Description of global variables in the XSL-FO style sheet
Variable | Description |
---|---|
default-width.list-label |
Specifies the default width of list labels |
default-font-size |
Specifies the default font size of text |
default-font-size.list-label |
Specifies the default font size of list labels |
default-font-size.symbol |
Specifies the default font size of Symbol font |
default-widows |
Specifies the default value for the widows property of paragraphs |
default-orphans |
Specifies the default value for the orphans property of paragraphs |
white-space-collapse |
Specifies whether to collapse white space characters |
default-header-extent |
Specifies the default extent of page headers |
default-footer-extent |
Specifies the default extent of page footers |
default-line-height |
Specifies the factor for line-spacing calculation |
Extended Style Sheet for Transforming Word Documents to a XSL-FO Format
In addition to the Word2FO.xsl style sheet for the general transformation, RenderX includes an additional style sheet that supports several RenderX extensions to the Extensible Stylesheet Language (XSL) Version 1.0 Specification. You can use this additional style sheet, Word2FO_ext.xsl, in place of the general style sheet to produce an XSL-FO formatted document for further processing with RenderX XEP software. The additional features in this style sheet include:
- Mapping document properties from the Word document into meta information in the output file.
- Support for continuous section breaks as continuous flows sections in the output document. This enables such features as supporting different column layouts on a single page.
- Mapping outline levels on section headings to PDF hierarchical bookmarks.
Conclusion
XSL-FO and WordprocessingML, when used together, create an extremely useful tool for transforming Word documents into other standard formats. XSL-FO makes it possible for an increasing number of systems to work with Word by enabling easy conversion into formats such as PDF or HTML. Check out RenderX or CambridgeDocs XML Conversion and Publishing Technologies if you would like to learn more about companies with products that take advantage of XSL-FO and WordprocessingML.
Additional Resources
The following is a list of additional resources to assist you when developing custom solutions using these style sheets.