EXSLT Meets XPath
Dare Obasanjo
Microsoft Corporation
November 12, 2003
Summary: Dare Obasanjo combines his efforts from a previous article on EXSLT with those of Oleg Tkachenko's article on creating multiple outputs from a single XSLT transformation to create a customer EXSLT library for the .NET Framework that includes support for EXSLT functions to regular XPath queries. (12 printed pages)
Download the xml11172003_sample.msi file.
Introduction
In the past few months both myself and guest author, Oleg Tkachenko, have written articles about extending the functionality of the XSLT implementation in the .NET Framework by adding extensions from the EXSLT community initiative. In my article, EXSLT: Enhancing the Power of XSLT, I implemented a number of EXSLT extension functions that provided features such as regular expression processing, set based operations, and functions for manipulating dates and times to users of XSLT in the .NET Framework. Oleg Tkachenko followed this with his article, Producing Multiple Outputs from an XSL Transformation, where he creates a mechanism for creating multiple output files from a single XSLT transformation by implementing the exsl:document
extension element.
Based on the feedback I got in response to both articles, I decided to create a GotDotNet workspace for our combined EXSLT efforts. This article describes the various improvements that have been made to the project since it was launched.
Using EXSLT Functions with the XPathNavigator
Once I started to use the EXSLT functions in my XSLT stylesheets I began to pine for them whenever I performed regular XPath expressions on XML documents. A number of people who've used XPath to query XML documents would love to have functions like regexp:replace or set:distinct when attempting to locate nodes containing certain values. The EXSLT implementation now provides a mechanism for using EXSLT extension functions when running XPath queries on any class that provides an XPathNavigator, such as the XPathDocument, XmlDocumentk, or the XmlDataDocument. Even better, this should work on any implementation of an XpathNavigator, such as the one provided in the previous column entitled XPath Querying Over Objects with ObjectXPathNavigatork.
The ability to perform XPath queries using EXSLT functions is provided by the ExsltContext class, which is derived from the XsltContext class. An explanation of how to implement custom XPath functions using subclasses of XsltContext is provided in the previous column Adding Custom Functions to XPath. The ResolveFunctionk is the key method overridden by the ExsltContext class that is used to provide resolution of custom functions. This method is called at during evaluation of the XPath query by the XPathNavigator to resolve references to EXSLT functions. In addition, the ExsltContextFunction class that implements the IXsltContextFunction interfacek is also utilized during the evaluation of XPath queries involving EXSLT functions. The Invoke() method on the ExsltContextFunction is called at run time by the XPathNavigator with the provided arguments, which then results in the appropriate EXSLT function being called.
Below is an XML document that describes some books I own along with information about them.
<books>
<book publisher="QUE">
<title>XML By Example</title>
<author>Benoit Marchal</author>
<publication-date>1999-12-31</publication-date>
<price>24.99</price>
</book>
<book publisher="Addison Wesley" on-loan="Dmitri">
<title>Essential C++</title>
<author>Stanley Lippman</author>
<publication-date>2000-10-31</publication-date>
<price>33.95</price>
</book>
<book publisher="WROX">
<title>XSLT Programmer's Reference</title>
<author>Michael Kay</author>
<publication-date>2001-04-30</publication-date>
<price>34.99</price>
</book>
<book publisher="Addison Wesley" on-loan="Sanjay">
<title>Mythical Man Month</title>
<author>Frederick Brooks</author>
<publication-date>1995-06-30</publication-date>
<price>29.95</price>
</book>
<book publisher="Apress">
<title>Programmer's Introduction to C#</title>
<author>Eric Gunnerson</author>
<publication-date>2001-06-30</publication-date>
<price>34.95</price>
</book>
</books>
When the above XML document is run through the XmlSerializer the following class is generated:
using System.Xml.Serialization;
/// <remarks/>
[System.Xml.Serialization.XmlRootAttribute("books", Namespace="", IsNullable=false)]
public class books {
/// <remarks/>
[System.Xml.Serialization.XmlElementAttribute("book")]
public bookType[] book;
}
/// <remarks/>
public class bookType {
/// <remarks/>
public string title;
/// <remarks/>
public string author;
/// <remarks/>
[System.Xml.Serialization.XmlElementAttribute("publication-date", DataType="date")]
public System.DateTime publicationdate;
/// <remarks/>
public System.Decimal price;
/// <remarks/>
[System.Xml.Serialization.XmlAttributeAttribute()]
public string publisher;
/// <remarks/>
[System.Xml.Serialization.XmlAttributeAttribute("on-loan")]
public string onloan;
}
The example below shows the results of running XPath queries using the ExsltContext class on the above XML document and an instance of the books class using the XPathDocument and ObjectXPathNavigator respectively. The extension functions math:max()
, math2:avg()
, and set:has-same-node()
are used in both cases.
using System.Xml.XPath;
using System.IO;
using GotDotNet.Exslt; //for EXSLT
using Microsoft.MSDN; //for ObjectXPathNavigator
public class ExsltXPathTest {
public static string highestprice1 = "math:max(/books/book/price)";
public static string averageprice1 = "math2:avg(/books/book/price)";
public static string overthirtybucksandonloan1 =
"set:has-same-node(//book[@on-loan], //book[price > 30])";
public static string highestprice2 = "math:max(/books/book/bookType/@price)";
public static string averageprice2 = "math2:avg(/books/book/bookType/@price)";
public static string overthirtybucksandonloan2 =
"set:has-same-node(//bookType[@onloan], //bookType[@price > 30])";
public static void Main(string[] args) {
try{
TestXPathDocument();
TestObjectXPathNavigator();
Console.ReadLine();
}catch(Exception e){
Console.WriteLine(e);
}
}
private static void TestXPathDocument(){
Console.WriteLine("BEGIN: Testing XPathDocument...");
XPathDocument doc = new XPathDocument(@"C:\My Download Files\books.xml");
XPathNavigator nav = doc.CreateNavigator();
XPathExpression expr = nav.Compile(highestprice1);
ExsltContext ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("The highest priced book I own costs ${0}", nav.Evaluate(expr));
nav = doc.CreateNavigator();
expr = nav.Compile(averageprice1);
ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("The average price of the books I own is ${0}", nav.Evaluate(expr));
nav = doc.CreateNavigator();
expr = nav.Compile(overthirtybucksandonloan1);
ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("Q: Do I have any books that cost over $30 loaned out?\nA: {0}",
(bool) nav.Evaluate(expr) ? "Yes" : "No");
Console.WriteLine("END: Testing XPathDocument...");
}
private static void TestObjectXPathNavigator(){
Console.WriteLine("BEGIN: Testing ObjectXPathNavigator...");
TextReader reader = new StreamReader(@"C:\My Download Files\books.xml");
XmlSerializer serializer = new XmlSerializer(typeof(books));
books myBooks = (books)serializer.Deserialize(reader);
reader.Close();
XPathNavigator nav = new ObjectXPathNavigator(myBooks );
XPathExpression expr = nav.Compile(highestprice2);
ExsltContext ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("The highest priced book I own costs ${0}", nav.Evaluate(expr));
nav = new ObjectXPathNavigator(myBooks );
expr = nav.Compile(averageprice2);
ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("The average price of the books I own is ${0}", nav.Evaluate(expr));
nav = new ObjectXPathNavigator(myBooks );
expr = nav.Compile(overthirtybucksandonloan2);
ctxt = new ExsltContext(nav.NameTable);
expr.SetContext(ctxt);
Console.WriteLine("Q: Do I have any books that cost over $30 loaned out?\nA: {0}",
(bool) nav.Evaluate(expr) ? "Yes" : "No");
Console.WriteLine("END: Testing ObjectXPathNavigator...");
}
}
The program above produces the following output:
BEGIN: Testing XPathDocument...
The highest priced book I own costs $34.99
The average price of the books I own is $31.766
Q: Do I have any books that cost over $30 loaned out?
A: Yes
END: Testing XPathDocument...
BEGIN: Testing ObjectXPathNavigator...
The highest priced book I own costs $34.99
The average price of the books I own is $31.766
Q: Do I have any books that cost over $30 loaned out?
A: Yes
END: Testing ObjectXPathNavigator...
Note Due to a bug in version 1.0 of the .NET Framework, an exception is thrown when running queries that utilize namespaces and involve XPath extension functions. This bug is fixed in version 1.1 of the .NET Framework.
Interoperability with Other Implementations of EXSLT
The previous EXSLT implementation described in this column was not conformant to the EXSLT specification in a number of ways. First of all, a number of function names were different from those described in the specification because they contained hyphens (set:has-same-node
, for example), which are not valid in the method names of C# classes. To get around this issue, Oleg Tkachenko came up with a workaround by directly modifying the generated MSIL from compiling the class and altering the method name. This solution makes it possible to have a method that is written as follows:
public bool hasSameNode(XPathNodeIterator nodeset1, XPathNodeIterator nodeset2)
But accessed from an XSLT stylesheet as:
has-same-node(nodeset1, nodeset2)
Another problem was that in the previous EXSLT implementation I added certain extension functions, but didn't place them in a different namespace from that of those actually in the EXSLT specification. This could potentially lead to portability problems if a developer used one of those functions thinking that it was part of the EXSLT specification only to find out it was specific to my implementation. This has been fixed by putting all the new functions that are not in the EXSLT specification in separate namespaces.
Finally, certain XSLT processors such as Saxon and Xalan also provide alternative names for certain functions (such as set:hasSameNode
and set:has-same-node
). The EXSLT implementation has been updated to support the alternate names of the various EXSLT functions.
Improving Performance with Smarter Algorithms
After running some performance tests on the previous EXSLT implementation, Dimitre Novatchev noticed a number of places where algorithms used by the functions that operated on node sets such as distinct() and has-same-node() could be improved. The original implementations of these methods performed linear searches that tended to be inefficient. Below is a comparison of the original and improved implementations of the distinct() function that takes an input node set then returns an output node set where nodes with duplicate values have been removed from the input.
Original:
public XPathNodeIterator distinct(XPathNodeIterator nodeset){
ExsltNodeList nodelist = new ExsltNodeList();
while(nodeset.MoveNext()){
if(!nodelist.ContainsValue(nodeset.Current.Value)){
nodelist.Add(nodeset.Current.Clone());
}
}
return ExsltCommon.ExsltNodeListToXPathNodeIterator(nodelist);
}
Improved:
public XPathNodeIterator distinct(XPathNodeIterator nodeset){
if(nodeset.Count > 15)
return distinct2(nodeset);
//else
ExsltNodeList nodelist = new ExsltNodeList();
while(nodeset.MoveNext()){
if(!nodelist.ContainsValue(nodeset.Current.Value)){
nodelist.Add(nodeset.Current.Clone());
}
}
return ExsltCommon.ExsltNodeListToXPathNodeIterator(nodelist);
}
private XPathNodeIterator distinct2(XPathNodeIterator nodeset){
ExsltNodeList nodelist = new ExsltNodeList();
Hashtable ht = new Hashtable(nodeset.Count / 3);
while(nodeset.MoveNext()){
string strVal = nodeset.Current.Value;
if(! ht.ContainsKey(strVal) ){
ht.Add(strVal, "");
nodelist.Add(nodeset.Current.Clone());
}
}
return ExsltCommon.ExsltNodeListToXPathNodeIterator(nodelist);
}
The original implementation of the function looped through each node in the input node set (an XPathNodeIterator instance) and checked if to see if the node was in the list of distinct nodes seen using the ContainsValue() method. If the current node was not in the list, then it was inserted into it. The inefficiency in this method is that the ContainsValue() method loops through each node in the list in a linear manner, meaning that in the case where the node isn't already in the list it loops through every node in the list. The improved implementation uses a hash table to lookup whether a node value is distinct or not, which is faster than performing a linear search of the node list. The tradeoff is that a hash table of distinct values is now needed. Similar improvements were made to the other EXSLT set functions in the ExsltSets class.
The table below shows the differences between the old and new implementations of the methods in the ExsltSets class.
Table 1. Differences in the intersection() method
No. of nodes | 10000 | 2000 | 1000 | 500 | 250 | 100 | 50 |
---|---|---|---|---|---|---|---|
Old implementation | 5662ms | 110ms | 39ms | 18ms | 13ms | 10ms | 9.9ms |
New implementation | 155ms | 46ms | 35ms | 18ms | 13ms | 10ms | 910.9ms |
Table 2. Differences in the difference() method
No. of nodes | 10000 | 2000 | 1000 | 500 | 250 | 100 | 50 |
---|---|---|---|---|---|---|---|
Old implementation | 5718ms | 120ms | 45ms | 29ms | 20ms | 18ms | 15ms |
New implementation | 120ms | 40ms | 30ms | 29ms | 20ms | 18ms | 15ms |
Table 3. Differences in the distinct() method
No. of nodes | 10000 | 2000 | 1000 | 500 | 250 | 100 | 50 |
Old implementation | 7420ms | 280ms | 70ms | 30ms | 15ms | 15ms | 12ms |
New implementation | 45ms | 20ms | 15ms | 10ms | 10ms | 10ms | 10ms |
Table 4. has-same-node() method
No. of nodes | 10000 | 2000 | 1000 | 500 | 250 | 100 | 50 |
---|---|---|---|---|---|---|---|
Old implementation | 6299ms | 145ms | 50ms | 26ms | 13ms | 11ms | 11ms |
New implementation | 160ms | 48ms | 32ms | 26ms | 13ms | 11ms | 11ms |
Table 5. Differences in the subset() method
No. of nodes | 10000 | 2000 | 1000 | 500 | 250 | 100 | 50 |
---|---|---|---|---|---|---|---|
Old implementation | 9133ms | 310ms | 88ms | 35ms | 19ms | 12ms | 11ms |
New implementation | 89ms | 30ms | 25ms | 21ms | 19ms | 12ms | 11ms |
The test results are in milliseconds. All tests were run on a PC with a 1.7 GHz CPU and 256 MB of RAM running Microsoft Windows® 2000.
Conclusion
The combined efforts of various members of our developer community, such as Oleg Tkachenko, Dimitre Novatchev, and Paul Reid, have turned the samples in the Extreme XML column into a worthwhile implementation of the EXSLT specification. Thanks to them for their efforts and I hope it encourages readers to visit the workspace and add their input as well.
Dare Obasanjo is a member of Microsoft's WebData team, which among other things develops the components within the System.Xml and System.Data namespace of the .NET Framework, Microsoft XML Core Services (MSXML), and Microsoft Data Access Components (MDAC).
Feel free to post any questions or comments about this article on the Extreme XML message board on GotDotNet.