Using Annotations to Transform LINQ to XML Trees in an XSLT Style (Improved Approach)

Článek
06/23/2008

You can use LINQ to XML to transform XML trees with the same level of power and expressability as with XSLT, and in many cases more than with XSLT.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCOne of the reasons that XSL is so powerful is that you can write multiple rules to transform a node. The rule that most specifically matches is the one that is applied.

To make this clear, consider the following source document:

<Parent>
<Heading>Heading 1 text</Heading>
<Heading>Heading 2 text</Heading>
</Parent>

We can specify a transform like this:

1 <?xml version='1.0'?>
2 <xsl:stylesheet xmlns:xsl='https://www.w3.org/1999/XSL/Transform' version='1.0'>
3 <xsl:template match='/Parent'>
4 <Root>
5 <xsl:apply-templates/>
6 </Root>
7 </xsl:template>
8 <xsl:template match='Heading[1]'>
9 <SpecialHeading>
10 <xsl:value-of select='.'/>
11 </SpecialHeading>
12 </xsl:template>
13 <xsl:template match='Heading'>
14 <H1>
15 <xsl:value-of select='.'/>
16 </H1>
17 </xsl:template>
18 </xsl:stylesheet>

When this stylesheet is applied to the source document, we see:

<Root>
<SpecialHeading>Heading 1 text</SpecialHeading>
<H1>Heading 2 text</H1>
</Root>

The template defined starting on line 8 is the transform that is applied for the first <Heading> element, even though the template defined on line 13 also matches. The rule on line 8 matches more specifically, so it is the one that is applied. This is the power of XSL – you supply transforms to nodes based on a pattern to match. The specificity of the rule is significant. This allows you to write powerful transformations where you first handle exception cases, and then impose rules that handle all other cases in a general way.

Another reason that XSL is so powerful is that you can apply a transformation to a specific node, and use the <xsl:apply-templates> element to indicate that child nodes should be transformed per their own rules.

If we have this source document:

And transform it with this stylesheet:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl='https://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:template match='/Parent'>
<Root>
<xsl:apply-templates/>
</Root>
</xsl:template>
<xsl:template match='Heading'>
<H1>
<xsl:apply-templates/>
</H1>
</xsl:template>
<xsl:template match='Text'>
<t>
<xsl:value-of select='.'/>
</t>
</xsl:template>
</xsl:stylesheet>

It results in this XML:

We were able to specify separate transforms for the Heading and Text elements, and by using the <xsl:apply-templates> element, the template to transform the heading doesn't have to concern itself with transforms of the child Text element.

Some time ago, I blogged on a technique for using annotations to transform LINQ to XML trees in this same style – the style of XSLT. Ever since that time, I've been mulling over the approach, thinking about how to improve it. This post summarizes and shows my current thoughts about this approach to performing document-centric transformations using LINQ to XML.

The example presented here transforms an Open XML word processing document to XHTML in less than 100 lines of code (not counting the infrastructure code that enables the transformation). The transformation that I present includes transforming paragraphs styled as Heading1 and Heading2 to h1 and h2 nodes, transforming hyperlinks, and bolded text. It even includes a rudimentary transformation of a word table to an XHTML table.

Note: all of the code mentioned in this post is attached to this page.

Here is the word document that I transform:

Here is the rendering of the resulting XHTML:

The code presented here is not a complete, full fidelity transform. However, it will serve to demonstrate the technique that I'm presenting here.

Note: I have plans to enhance this code (over time) so that this transformation is more complete. In particular, I plan on enhancing this code so that I can transform a DOCX into XHTML for my blog posts. I'd really like code presented in a blog post to have an automatically inserted "Copy Code" button above each code snippet.

The code presented here has the following features:

It allows for transformation of elements, text nodes, and comment nodes. To transform attributes, you specify a transform for the parent element node.
It allows for deletion of nodes that you don't care about. You can specify that a node be not transformed into the new tree.
It supports "mode", ala XSL transforms. This allows you to define multiple transformations of a single tree, and then perform the transform separately for each mode. You may want to have one transformation that transforms into a "table of contents" for the transformed document, and another transformation that transforms the main document part. You then assemble both transforms into a single document that then contains both the table of contents and the contents of the document. This is a common technique in XSL transformations.
It contains a model for specifying the equivalent of the <xsl:apply-templates> element. With the approach presented here, you can specify the nodes to transform – the equivalent of specifying the "select" attribute of the <xsl:apply-templates> element. In XSL, you specify the select attribute using an XPath expression; using this approach, you specify the TransformSelect property of the ApplyTransforms class using a LINQ expression.

Document-Centric Transforms

Some XML documents are "document-centric" With such documents, you don't necessarily know the shape of child nodes of an element. For instance, a node that contains text may look like this:

<text>A phrase with <b>bold</b> and <i>italic</i> text.</text>

For any given text node, there may be any number of child <b> and <i> elements.

Open XML documents contain document-centric markup. For example, the body of the document can contain any number of paragraphs; tables are siblings to paragraphs; each paragraph can contain any number of formatted text runs; hyperlinks are expressed as sibling elements to text run elements. One of the primary characteristics of document centric XML is that you do not know exactly which child elements any particular element will have. They may be interspersed randomly.

If you want to transform nodes in a tree where you don't necessarily know which particular children an element may have, then this approach that uses annotations is an effective approach. This approach allows you to specify the transformation in a minimum amount of code.

Overview of the Approach

The summary of the approach is:

First, annotate nodes in the tree with a object of type TransformAnnotation (a type introduced in the code attached to this page). The TransformAnnotation.Replace property contains the new, transformed node. If TransformAnnotation.Replace == null, then the node is removed from the transformed tree. TransformAnnotation.Mode contains a string that specifies the transform mode, analogous to mode in XSL.
Second, your code calls a function that iterates through the entire tree, creating a new tree where the code replaces each node with the node specified in the TransformAnnotation.Replace property. This code presented here implements the iteration and creation of the new tree in an extension method on XNode named Transform. This is a pretty simple method – only about 115 lines long.

In detail, the approach consists of:

Execute one or more LINQ to XML queries that return the set of nodes that you want to transform from one shape to another. For each node in the query, add a new TransformAnnotation object as an annotation to the node. The TransformAnnotation object contains a node that will replace the annotated node in the new, transformed tree.
For convenience, I've defined some extension methods on XNode (TransformRemove, and TransformReplace) that add the appropriate annotation. This results in cleaner code that specifies the rules of the transformation.
The new element (contained in a property of the annotation) can contain new child nodes; it can form a sub-tree with any desired shape.
You can add a "pseudo node" as a child node of the replacement node. This pseudo node tells the transform code to apply further transformations. It serves the same purpose as the <xsl:apply-templates> element in an XSL sequence constructor. To allow inserting the pseudo node into the children of the element, the ApplyTransforms class derives from the XText class. (It is artificial to have this class derive from XText. I would have derived from XNode, however, XNode contains an internal abstract method, CloneNode, which prevents derivation outside of the assembly.) This special node isn't transformed into the new tree. Instead, it indicates to the transform code that further transformations should be performed and the results of the transformations should be inserted into the new tree. ApplyTransforms contains a property TransformSelect of type IEnumerable<XNode>. If TransformSelect is not null, then the nodes in the TransformSelect collection will be transformed and inserted. This allows us to write a query that evaluates to a collection of descendant nodes that should be transformed recursively. Alternatively, if TransformSelect is null, then the child nodes of the source element are iterated, transformed, and inserted. ApplyTransforms also contains a string property, Mode. As mentioned previously, Mode serves the same purpose as in XSL.

This is analogous to the specification of transforms in XSL. The query that selects a set of nodes is analogous to the XPath expression for a template. The code to create the new node in TransformAnnotation.Replace is analogous to the sequence constructor in XSL, and as mentioned, the ApplyTransforms node is analogous in function to the <xsl:apply-templates> element in XSL.

One primary advantage to taking this approach - as you formulate queries, you are always writing queries on the unmodified source tree. You don't need to concern yourself about how modifications to the tree affect the queries that you are writing.

Another primary advantage to this approach – you can specify that any node found throughout the source tree be transformed according to the specified rule without concerning yourself with the specific child nodes of the node. Those child nodes can have their own rule to specify their transformation.

As mentioned at the top of this post, in XSL, it's possible to define multiple rules that apply to any specific node. The semantics of XSL specify that the most specific match found is the transform that is applied. This allows you to define very specific transforms for certain nodes. You can then define a more general transform that applies in all other cases. The approach presented here has analogous semantics – the first annotation added is the one that is used for the transform. You can add other annotations to the node, but the subsequent annotations are simply ignored by the transformation. The first annotation added is the effective one.

The following is a simple example that shows how to transform a tree. It uses a special rule to transform the first heading to the element <SpecialHeading>. Other heading elements are transformed to <H1> elements. This demonstrates that the transform that we specified for the first heading takes precedence over transforms that were subsequently specified.

Example 1:

XElement sourceDocument = XElement.Parse(
@"<document>
<body>
<heading>Overview of the Technique</heading>
<t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t>
<heading>The Technique in Detail</heading>
<t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t>
<heading>Summary</heading>
<t>Pellentesque habitant morbi tristique.</t>
</body>
</document>");

// transform body to DocumentBody
sourceDocument
.Element("body")
.TransformReplace(new XElement("Body", new ApplyTransforms()));

// transform the first heading in a special way
sourceDocument
.Descendants("heading")
.First()
.TransformReplace(new XElement("SpecialHeading", new ApplyTransforms()));

// transform heading to H1
foreach (var item in sourceDocument.Descendants("heading"))
item.TransformReplace(new XElement("H1", new ApplyTransforms()));

Console.WriteLine(sourceDocument.Transform());

This example produces the following output:

<document>
<Body>
<SpecialHeading>Overview of the Technique</SpecialHeading>
<t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t>
<H1>The Technique in Detail</H1>
<t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t>
<H1>Summary</H1>
<t>Pellentesque habitant morbi tristique.</t>
</Body>
</document>

The following example demonstrates the use of modes. It uses the same source document as the above example. It defines two transforms, one where the mode = "TOC", which transforms the document into a table of contents. The second transform passes no argument to the Transform method, which means that it matches when mode = null. This transforms the document into a different form for the body of the new document.

Example 2:

// define the root transform for the table of contents
sourceDocument.TransformReplace(
new XElement("TableOfContents",
new ApplyTransforms(sourceDocument.Element("body").Elements("heading"), "TOC")), "TOC");

// define the transform of each heading element for the table of contents
foreach (var item in sourceDocument.Descendants("heading"))
{
item.TransformReplace(new XElement("TocItem", (string)item), "TOC");
}

// define the transform of the document body
sourceDocument.Element("body").TransformReplace(
new XElement("Body",
new ApplyTransforms(sourceDocument.Element("body").Elements())
)
);

// define the transforms of heading elements for the document body
foreach (var item in sourceDocument.Descendants("heading"))
{
item.TransformReplace(new XElement("H1", new ApplyTransforms()));
}

// define the transforms of t elements for the document body
foreach (var item in sourceDocument.Descendants("t"))
{
item.TransformReplace(new XElement("Text", new ApplyTransforms()));
}

// assemble the new document with both TOC and body
XElement newDoc = new XElement("Root",
sourceDocument.Transform("TOC"),
sourceDocument.Element("body").Transform()
);

Console.WriteLine(newDoc);

This example produces:

<Root>
<TableOfContents>
<TocItem>Overview of the Technique</TocItem>
<TocItem>The Technique in Detail</TocItem>
<TocItem>Summary</TocItem>
</TableOfContents>
<Body>
<H1>Overview of the Technique</H1>
<Text>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</Text>
<H1>The Technique in Detail</H1>
<Text>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</Text>
<H1>Summary</H1>
<Text>Pellentesque habitant morbi tristique.</Text>
</Body>
</Root>

The final example presented here transforms an Open XML document into XHTML. It defines a number of transforms by annotating a variety of nodes. At the end, it adds annotations to every node in the tree indicating that the node should be deleted from the transformed tree. But this rule that deletes nodes is ignored for all nodes that have already been annotated.

Note: this code uses the Open XML SDK, which is available here.

DocxToHtml:

using (WordprocessingDocument wordDoc = WordprocessingDocument.Open("Test.docx", true))
{
XDocument doc = wordDoc.MainDocumentPart.GetXDocument();

XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XNamespace r = "https://schemas.openxmlformats.org/officeDocument/2006/relationships";
XNamespace h = "https://www.w3.org/1999/xhtml";

// transform the document root element to the XHTML root element
doc.Root.TransformReplace(
new XElement(h + "html",
new XElement(h + "head",
new XElement(h + "title", "Test.docx")
),
new ApplyTransforms(doc.Root.Elements(w + "body"))
)
);

// transform the w:body element to the XHTML h:body element
doc.Element(w + "document").Element(w + "body").TransformReplace(
new XElement(h + "body", new ApplyTransforms()));

// transform every hyperlink in the document to the XHTML h:A element
foreach (var item in doc.Descendants(w + "hyperlink"))
{
item.TransformReplace(
new XElement(h + "A",
new XAttribute("href",
wordDoc.MainDocumentPart
.ExternalRelationships
.Where(x => x.Id == (string)item.Attribute(r + "id"))
.First()
.Uri
),
new XText(item.Elements(w + "r")
.Elements(w + "t")
.Select(s => (string)s).StringConcatenate())
)
);
}

// transform every Heading1 styled paragraph to the XHTML h:h1 element
foreach (var item in doc.Descendants(w + "p")
.Where(z => (string)z.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault() == "Heading1"))
{
item.TransformReplace(new XElement(h + "h1", new ApplyTransforms()));
}

// transform every Heading2 styled paragraph to the XHTML h:h2 element
foreach (var item in doc.Descendants(w + "p")
.Where(z => (string)z.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault() == "Heading2"))
{
item.TransformReplace(new XElement(h + "h2", new ApplyTransforms()));
}

// transform every text run that is styled as bold to the XHTML h:b element
foreach (var item in doc.Descendants(w + "r")
.Where(z => z.Elements(w + "rPr").Elements(w + "b").Any()))
{
item.TransformReplace(
new XElement(h + "b",
item.Elements(w + "t")
.Select(e => (string)e).StringConcatenate()));
}

// transform every text run that is not styled as bold to a text node that contains the
// text of the paragraph.
foreach (var item in doc.Descendants(w + "r")
.Where(z => !z.Elements(w + "rPr").Elements(w + "b").Any()))
{
item.TransformReplace(
new XText(item.Elements(w + "t").Select(e => (string)e).StringConcatenate()));
}

// transform w:p to h:p
foreach (var item in doc.Descendants(w + "p"))
{
item.TransformReplace(new XElement(h + "p", new ApplyTransforms()));
}

// transform w:tbl to h:tbl
foreach (var item in doc.Descendants(w + "tbl"))
{
item.TransformReplace(
new XElement(h + "table",
new XAttribute("border", 1),
new ApplyTransforms()
)
);
}

// transform w:tr to h:tr
foreach (var item in doc.Descendants(w + "tr"))
{
item.TransformReplace(new XElement(h + "tr", new ApplyTransforms()));
}

// transform w:tc to h:td
foreach (var item in doc.Descendants(w + "tc"))
{
item.TransformReplace(new XElement(h + "td", new ApplyTransforms()));
}

// the following removes any nodes that haven't been replaced.
foreach (var item in doc.DescendantNodes())
{
item.TransformRemove();
}

XElement newDoc = (XElement)doc.Root.Transform();
newDoc.Save("test.html");
}

When run using the document attached to this post, it produces the following:

<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<title>Test.docx</title>
</head>
<body>
<h1>LINQ to XML Transformations in the Style of XSLT</h1>
<h2>Styled Text</h2>
<p>Some <b>bold</b> text.</p>
<p>Some normal text.</p>
<h2>Hyperlinks</h2>
<p>See my <A href="https://blogs.msdn.com/ericwhite" mce_href="https://blogs.msdn.com/ericwhite">blog</A>.</p>
<h2>Tables</h2>
<p>This text introduces the following tables:</p>
<table border="1">
<tr>
<td>
<p>
<b>Order Number</b>
</p>
</td>
<td>
<p>
<b>Order Date</b>
</p>
</td>
<td>
<p>
<b>Amount</b>
</p>
</td>
</tr>
<tr>
<td>
<p>124245</p>
</td>
<td>
<p>10/24/2008</p>
</td>
<td>
<p>42.55</p>
</td>
</tr>
<tr>
<td>
<p>147867</p>
</td>
<td>
<p>10/31/2008</p>
</td>
<td>
<p>88.99</p>
</td>
</tr>
</table>
<p />
<p>Item Detail for Order 124245</p>
<table border="1">
<tr>
<td>
<p>
<b>Line Number</b>
</p>
</td>
<td>
<p>
<b>Item</b>
</p>
</td>
<td>
<p>
<b>Quantity</b>
</p>
</td>
</tr>
<tr>
<td>
<p>1</p>
</td>
<td>
<p>HH242</p>
</td>
<td>
<p>3</p>
</td>
</tr>
<tr>
<td>
<p>2</p>
</td>
<td>
<p>TY149</p>
</td>
<td>
<p>8</p>
</td>
</tr>
<tr>
<td>
<p>3</p>
</td>
<td>
<p>ZZTXT</p>
</td>
<td>
<p>4</p>
</td>
</tr>
</table>
<p />
</body>
</html>

Thanks to Dirk Myers who suggested that this approach could support modes.

DocxToHtml.zip

Comments

Anonymous
June 23, 2008
The comment has been removed
Anonymous
June 23, 2008
Very cool! Do you have performance metrics comparing this technique to XSLT? It would be so cool if you could talk the BizTalk team into making the BizTalk mapper output this instead of XSLT.
Anonymous
June 23, 2008
I don't yet have performance metrics. There is an open source XSL that does a fairly comprehensive transform of an Open XML document to html. I'd really like my DocxToHtml.cs to be as comprehensive as the XSL version. At that point, it would be fair to compare the performance of the two. Right now, the only transforms that I have are the trivial ones presented in this post, as I just wrote this code over the weekend. It wouldn't really be meaningful to define the same transformations in XSL and compare perf - an XSL transform must fire up the XSL transformation engine, which is a fixed perf cost for every transform. As I improve the DocxToHtml transform, I'll certainly post the improved code. And when it is comprehensive enough, we'll get numbers. I'm going to make a bet, however, that this code will perform very well when compared with XSLT.
Anonymous
June 24, 2008
XSLT has a lot of very important ideas that are locked up in an unfortunate syntax. Any effort to extract these ideas into a cleaner environment is a good one. However, I see nothing in your demonstration that convinces me you have accomplished this and certainly nothing to suggest: "LINQ to XML to transform XML trees with the same level of power and expressability as with XSLT, and in many cases more than with XSLT." Can you elaborate how this has the more power and expressability than XSLT?
Anonymous
June 24, 2008
Hi Sal, Yes, XSLT is very powerful, and has many important ideas in an unfortunate syntax. To me, the most important ideas are:

pattern matching / application of templates
ability to specify a transformation for a node, while applying further transformations deeper in the tree
when used properly, a pure, stateless functional approach Using this approach, LINQ also can use pattern matching to select nodes to transform, and ability to specify transformations for any node, while applying further transformations deeper in the tree. And of course, with lambda expressions, extension methods, deep support for generics, etc., LINQ/C# 3.0 syntax certainly supports a natural syntax for a pure functional approach. I personally find the syntax of LINQ, and of LINQ to XML easier to read and write than XSLT, hence the expressability. I also consider the much richer type system in C# to be something that also adds to the power and expressability. With regards to power, the last example presented in this post includes referencing data that is contained in some C# objects. If attempting this transform with XSLT, you must first serialize this data into XML and then use the XML in a transformation, or you must escape into an external language to retrieve the data. Using LINQ to XML, you can write one single, pure, functional transformation that can use data from anywhere in the .NET stack. You could write a pure, functional transformation that uses data from a database, the file system, XML of course, JSON, web services, etc., etc. There are downsides to using LINQ - the language doesn't enforce a pure functional approach. You can easily break purity by mutating variables, etc. In contrast, with XSLT, you have try hard to write impure code. What has been missing in the C# 3.0 / LINQ conversation is a technique for expressing the pattern matching / templates approach. This post describes my approach for doing this. -Eric

Anonymous
June 24, 2008
I think the features of XSLT you reference are imporant but what makes an XSLT solution more powerful in the domain of XML -> XML transforms are "Literal Result Elements" which are cleaner syntactically than "new XElement(...)". XSLT 2.0 also has a much richer facility for specializing templates then you imply in your post (multiple modes, xsl:next-match, xsl:import, named templates, priorities, etc.) Also, using a subset of XPath as a pattern language is very powerful. Your LINQ based approach is far too procedural for my tastes. However, outside of XML -> XML, XSLT starts to trip over itself.
Anonymous
June 24, 2008
The comment has been removed
Anonymous
June 27, 2008
Sal, This conversation has really provoked thought about how to specify these types of transformations in a more declarative way. This conversation has been very valuable to me, it has made me see this from another perspective. One thing that wasn't clear - I see the foreach iterations as just another form of a function call, maybe more accessible to some developers. So the specification becomes a collection of calls to functions, where each call does annotation of the source tree. So in my mind, I hadn't deviated from a pure, functional transform. However, the code LOOKS much more procedural. Rewriting the transforms using the ForEach extension method helps make it look less procedural. But there are ways to make the transformation specification even more declarative. I'm currently working on another iteration of the approach. Should be posted sometime in the next little while. My only motivation here is think about new ways to do transformation, not to promote one technology over another. I just have to say, there are scenarios where XSLT is a fantastic technology, and probably the only feasible technology. In certain scenarios, I consider XSLT as the only option. I personally know of examples they are using primarily a natively compiled XSLT processor; nothing else performs well enough. Anyway, please don't take my ramblings as an expression of disrespect for XSLT. It is the wonderful ideas of XSLT that I have missed when writing LINQ to XML transformations. And also, any ideas that I express are completely my own, and not Microsoft’s. :-) -Eric
Anonymous
June 30, 2008
I'm pretty familiar with (and hence?) very comfortable with LINQ syntax. Not so with XSLT. I wonder if you could extend the above example to be a "LINQ to XML to XSLT" compiler? It would so cool if say, I could use LINQPad to quickly write a transform and then have that magically compiled to XSLT (which is so much easier to release because you don't have to worry about .net 3.5 being installed etc, and will work on so many older systems etc )
Anonymous
June 30, 2008
Les quelques liens intéressants de la semaine pour vos projets Open XML : Technique : Custom XML : répéter
Anonymous
June 30, 2008
Vishal, In theory, it could be possible. The most work would be generation of XPath expressions from an expression tree. One issue - there are LINQ expressions that have no correlating XPath expression. But it's a worthy activity to learn XSLT. I really appreciated the book by Michael Kay. -Eric
Anonymous
December 10, 2009
What about following

superscript subscript
Table heading alignment - top - bottom centered.. Also it sometimes bolds text which is clearly not bold.

Anonymous
December 10, 2009
Hey Alex, I'm currently working on a much better conversion of word-proc docs to Html. I should be blogging on this in the next couple of weeks. But caveat, the first version will not contain formatting, just an accurate representation of content. I'm planning on completing the version with formatting sometime early next year. -Eric

Sdílet prostřednictvím

Using Annotations to Transform LINQ to XML Trees in an XSLT Style (Improved Approach)

Comments

Další materiály