Equality Semantics of LINQ to XML Trees
In certain scenarios, it is important to be able to compare two XML trees for equivalence. For example, if you are writing a web service that serves results of queries, and you want to cache query results so that duplicate queries use previously cached results instead of always accessing the underlying database. However, the senders of those queries may potentially be using a variety of tools to generate the queries, and these tools may introduce trivial differences into the XML. The intent of the queries may be identical, but XNode.DeepEquals returns false if you compare the XML that contains semantically equivalent, but trivially different queries.
This blog is inactive.
New blog: EricWhite.com/blog
Blog TOC(July 1, 2009 - Updated - Normalize empty elements to use a self-closing tag.)
This post describes an approach and presents code for normalizing LINQ to XML trees. After normalizing XML trees, you can call XNode.DeepEquals with a greater chance that semantically equivalent XML trees will evaluate as equivalent. The approach presented in this post makes use of the assistance that XSD can provide when normalizing XML trees. If an XML document has been validated using XSD, and the XML tree has been annotated with the Post Schema Validation Infoset (PSVI), then we can use that PSVI to help normalize the tree. Much of this post is based on the OASIS specification Schema Centric XML Canonicalization Version 1.0 (C14N), which describes a variety of ways to normalize an XML tree, including through the use of PSVI.
Let’s look at a few common cases where XNode.DeepEquals reports that equivalent XML trees are unequal. In the following example, the two trees have exactly the same content. The only difference is that one of them uses a default namespace, and the other uses the same namespace using a namespace prefix. XNode.DeepEquals will report that the two trees are different.
XElement root1 = XElement.Parse(
@"<Root xmlns='https://www.northwind.com'>
<Child>1</Child>
</Root>");
XElement root2 = XElement.Parse(
@"<n:Root xmlns:n='https://www.northwind.com'>
<n:Child>1</n:Child>
</n:Root>");
if (XNode.DeepEquals(root1, root2))
Console.WriteLine("Equal");
else
Console.WriteLine("Not Equal");
Here’s another case where XNode.DeepEquals returns false:
XElement root1 = XElement.Parse(
@"<Root a='1' b='2'>
<Child>1</Child>
</Root>");
XElement root2 = XElement.Parse(
@"<Root b='2' a='1'>
<Child>1</Child>
</Root>");
if (XNode.DeepEquals(root1, root2))
Console.WriteLine("Equal");
else
Console.WriteLine("Not Equal");
These two are equivalent, but the attributes are not ordered. However, because you can’t have two attributes in an element that have the same qualified name, we can sort the attributes by namespace and name, and then compare.
The situation gets more interesting when we consider normalizing a document that has been validated using XSD. Consider the following simple schema:
<xsd:schemaxmlns:xsd='https://www.w3.org/2001/XMLSchema'>
<xsd:elementname='Root'
type='xsd:double'/>
</xsd:schema>
Let’s say that we are comparing two small XML documents:
Document 1:
<Root>25</Root>
Document 2:
<Root>+25</Root>
These two documents have essentially the same content, but the ‘+’ before the 25 in the second document will cause the two documents to compare as not equals. However, if we have first validated the documents and annotated the tree with a PSVI using the XDocument.Validate extension method, we know that the data type of the element is double, and can normalize the value of the element. Once normalized, the two documents will compare as equals.
Going further, a schema can declare default attributes and elements. If an XML document in one case has an attribute with the default value, and in another case, the XML document is missing the default attribute, the attribute can be added when validating, and the two trees will compare as equivalent.
For illustration, consider this schema and two documents:
<xsd:schemaxmlns:xsd='https://www.w3.org/2001/XMLSchema'>
<xsd:elementname='Root'>
<xsd:complexType>
<xsd:simpleContent>
<xsd:extensionbase='xsd:string'>
<xsd:attributename='ADefaultBooleanAttribute'
default='false'/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
</xsd:schema>
Document 1:
<Root/>
Document 2:
<Root ADefaultBooleanAttribute='false'/>
The above two documents will compare as equivalent after validation and adding PSVI.
In the remainder of this post, I’ll present a list of areas where it would be possible to normalize a tree before comparison. Then, I’ll discuss a simple implementation of normalization using LINQ to XML.
Normalization Issues
The following is a list of issues that can be addressed when normalizing an XML tree:
- Insignificant white space should not exist in a normalized tree.
- Namespace prefixes and the use of default namespaces should not be significant. It is sufficient to compare qualified names while disregarding whether the namespaces are serialized by a prefix, or as the default namespace.
- Missing default elements and attributes should be added to the XML tree when normalizing.
- Values of elements and attributes of certain data types can be normalized. Types that can be normalized include xsd:boolean, xsd:dateTime, xsd:decimal, xsd:double, xsd:float, xsd:hexBinary,and xsd:language.
- The attributes xsi:schemaLocation and xsd:noNamespaceSchemaLocation exist only to give hints to a schema processor about the location of the schema. We can discard these attributes when normalizing an XML tree.
- We can order attributes alphabetically by namespace and name, eliminating insignificant ordering differences.
- Comments and processing instructions are not semantically significant when comparing trees. We can remove them when normalizing.
- Empty elements (<Tag></Tag>) can be normalized to use a self-closing tag (<Tag/>). These two forms are semantically equivalent per the XML specification. Further, if an element is declared to be EMPTY, only a self-closing tag will do, whereas any empty element can be converted to a self-closing tag, so it's always safe to convert to self-closing. This is the default behavior of LINQ to XML.
One important point about normalization is that you can define application-specific normalization rules. It would be easy to modify the normalization code presented in this post to, say, look for a particular named complex type in the PSVI, and normalize that element per specific rules, perhaps converting attribute or element values to upper or lower case, or some other bespoke normalization.
LINQ to XML Implementation of Normalization
The approach that I took when normalizing is to generate a new, normalized XML tree (instead of modifying the existing tree to normalize it.) This has the advantage that we can write this code in the pure functional style. The resulting code will be small and easy to maintain. In the example code that I present in this post, the pure functional transform to generate a new normalized XML tree is about 190 lines of code. For more information on this style of cloning, see the blog post “Manually Cloning LINQ to XML Trees”.
Instead of optimizing for processing speed, I optimized for compactness and readability of code. This is probably a good trade-off. This code will be pretty fast in any case, and other processes in our solution, such as database access, can be multiple orders of magnitude slower, so it probably doesn’t matter if there are ways to write the code so that it will execute slightly faster.
XML Trees Must Validate before Normalization
The first and most important point is that this code relies on the XML validating successfully before normalizing. The code allows you to specify no schema, but in this case, it only does the normalization that doesn’t require PSVI. But if you specify a schema, and the code doesn’t throw an exception, then the document(s) were valid per the schema.
White Space
The default behavior of LINQ to XML is to discard insignificant white space when populating an XML tree. No text nodes for insignificant white space are materialized. This makes it easy for us – the code presented here assumes that insignificant white space is not in the XML tree.
Namespace Prefix Normalization
Because LINQ to XML has a simple model for namespaces, normalization is easy. The full namespace is included in every XName object. There can be XAttribute objects that contain namespace declarations, however, these “attributes” only have an effect upon serialization. For the purposes of normalization, we can simply remove all XAttribute objects that are namespace declarations. We can determine whether an attribute is a namespace declaration using the XAttribute.IsNamespaceDeclaration property.
Adding Default Elements and Attributes
One of the overloads of the Validate extension methods for XDocument allows us to pass a Boolean parameter that indicates that the XML tree be populated with the PSVI. As part of this population of the PSVI, default elements and attributes will be added to the tree. They will, then, be cloned in the new tree.
Normalizing Values of Elements and Attributes of Certain Data Types
For five data types (xsd:boolean, xsd:decimal, xsd:double, xsd:dateTime, and xsd:float), we can take advantage of one particular aspect of LINQ to XML semantics: LINQ to XML is lax when validating values that are passed to constructors, and is strict when serializing values. For instance, if we know that an attribute is type xsd:double, we can create a new normalized attribute like this:
return new XAttribute(a.Name, (double)a);
And in a similar way, we can create a normalized xsd:double element like this:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(double)element);
Elements and attributes of type xsd:hexBinary and xsd:language are case insensitive. We can normalize by converting them to lower case.
Removing Comments, Processing Instructions, and the Attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation
This is straightforward – the code trims these while cloning.
Ordering Attributes by Name
While cloning the tree, it is straightforward to order attributes by name.
Normalize Empty Elements to Self-Closing Elements
The default behavior of LINQ to XML is to serialize an empty element as a self-closing element.
Normalization Issues Not Handled
Schema Centric XML Canonicalization Version 1.0 (C14N) describes a number of other possibilities for normalization that I’ve not handled:
- When using an XML programming interface other than LINQ to XML, there are some things we could do to normalize entities from a DTD. However, because LINQ to XML has eliminated entity support, and all entities are expanded before the LINQ to XML tree is populated, we can disregard normalization issues with entities.
- Data stored in base64Binary can have white space interjected at very specific points. Normalization could include making sure that this white space conforms to the specification. I left this as an exercise for the reader. This can be accomplished in a few lines of code each for elements and attributes. Feel free to post your solution in a comment below. J
- This code does no character modeling normalization other than any normalization done by XmlReader when deserializing the XML.
- An XML document can contain an element or attribute that contains an XPath expression. This expression will probably rely on the use of particular namespace prefixes. This code does not attempt to normalize XPath expressions that use specific namespace prefixes. If the XML that you are normalizing contains XPath expressions, it will be necessary to add in the XAttribute namespace definitions so that the XML will serialize with the correct namespace prefix. The code presented here doesn’t deal with these issues.
- There are rules for normalization of white space within attributes for certain data types. This code doesn’t do any normalization of this white space.
- The C14N document states that if a complex element contains only child elements (mixed content is not allowed), and if the element validates against a model group whose compositor is xsd:all, then the order of child elements is not significant. It could be possible to sort them by namespace and name when normalizing. Initially, I wrote the code to do this normalization. However, I think it’s possible that code can implement differing behavior based on ordering, so I elected to not implement this. If you believe that this normalization is or isn’t valid, please feel free to advise me in a comment. J
About the Code
This post presents two methods: Normalize and DeepEqualsWithNormalization.
Normalize
This method creates and returns a new, cloned, normalized XDocument.
Because it relies on schema validation, it normalizes an XDocument, not an XElement. However, it is easy to create an XDocument from an XElement if you need to normalize an XML tree rooted in an XElement.
It is valid to pass null for the schema parameter, in which case the method will only do the normalizations that are possible without using PSVI.
Signature:
public static XDocument Normalize(XDocument source, XmlSchemaSet schema)
DeepEqualsWithNormalization
This method compares two XDocument objects after normalization.
It is valid to pass null for the schema parameter, in which case the method will only do the normalizations that are possible without using PSVI.
Signature:
public static bool DeepEqualsWithNormalization(XDocument doc1, XDocument doc2,
XmlSchemaSet schemaSet)
Test Cases
In the attached code, along with the public and private methods to do the normalization, etc, there are about a dozen test cases that cause full code coverage of the normalization code. The sample code runs each test case, and reports whether the new normalized trees are equivalent.
The Code
The following listing shows the code used for XML tree normalization and comparison, including an extensions class and the Normalize and DeepEqualsWithNormalization methods:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Xml.Schema;
public static class MyExtensions
{
public static string ToStringAlignAttributes(this XDocument document)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.OmitXmlDeclaration = true;
settings.NewLineOnAttributes = true;
StringBuilder stringBuilder = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(stringBuilder, settings))
document.WriteTo(xmlWriter);
return stringBuilder.ToString();
}
}
class Program
{
private static class Xsi
{
public static XNamespace xsi = "https://www.w3.org/2001/XMLSchema-instance";
public static XName schemaLocation = xsi + "schemaLocation";
public static XName noNamespaceSchemaLocation = xsi + "noNamespaceSchemaLocation";
}
public static XDocument Normalize(XDocument source, XmlSchemaSet schema)
{
bool havePSVI = false;
// validate, throw errors, add PSVI information
if (schema != null)
{
source.Validate(schema, null, true);
havePSVI = true;
}
return new XDocument(
source.Declaration,
source.Nodes().Select(n =>
{
// Remove comments, processing instructions, and text nodes that are
// children of XDocument. Only white space text nodes are allowed as
// children of a document, so we can remove all text nodes.
if (n is XComment || n is XProcessingInstruction || n is XText)
return null;
XElement e = n as XElement;
if (e != null)
return NormalizeElement(e, havePSVI);
return n;
}
)
);
}
public static bool DeepEqualsWithNormalization(XDocument doc1, XDocument doc2,
XmlSchemaSet schemaSet)
{
XDocument d1 = Normalize(doc1, schemaSet);
XDocument d2 = Normalize(doc2, schemaSet);
return XNode.DeepEquals(d1, d2);
}
private static IEnumerable<XAttribute> NormalizeAttributes(XElement element,
bool havePSVI)
{
return element.Attributes()
.Where(a => !a.IsNamespaceDeclaration &&
a.Name != Xsi.schemaLocation &&
a.Name != Xsi.noNamespaceSchemaLocation)
.OrderBy(a => a.Name.NamespaceName)
.ThenBy(a => a.Name.LocalName)
.Select(
a =>
{
if (havePSVI)
{
var dt = a.GetSchemaInfo().SchemaType.TypeCode;
switch (dt)
{
case XmlTypeCode.Boolean:
return new XAttribute(a.Name, (bool)a);
case XmlTypeCode.DateTime:
return new XAttribute(a.Name, (DateTime)a);
case XmlTypeCode.Decimal:
return new XAttribute(a.Name, (decimal)a);
case XmlTypeCode.Double:
return new XAttribute(a.Name, (double)a);
case XmlTypeCode.Float:
return new XAttribute(a.Name, (float)a);
case XmlTypeCode.HexBinary:
case XmlTypeCode.Language:
return new XAttribute(a.Name,
((string)a).ToLower());
}
}
return a;
}
);
}
private static XNode NormalizeNode(XNode node, bool havePSVI)
{
// trim comments and processing instructions from normalized tree
if (node is XComment || node is XProcessingInstruction)
return null;
XElement e = node as XElement;
if (e != null)
return NormalizeElement(e, havePSVI);
// Only thing left is XCData and XText, so clone them
return node;
}
private static XElement NormalizeElement(XElement element, bool havePSVI)
{
if (havePSVI)
{
var dt = element.GetSchemaInfo();
switch (dt.SchemaType.TypeCode)
{
case XmlTypeCode.Boolean:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(bool)element);
case XmlTypeCode.DateTime:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(DateTime)element);
case XmlTypeCode.Decimal:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(decimal)element);
case XmlTypeCode.Double:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(double)element);
case XmlTypeCode.Float:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(float)element);
case XmlTypeCode.HexBinary:
case XmlTypeCode.Language:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
((string)element).ToLower());
default:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
element.Nodes().Select(n => NormalizeNode(n, havePSVI))
);
}
}
else
{
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
element.Nodes().Select(n => NormalizeNode(n, havePSVI))
);
}
}
}
Comments
Anonymous
February 24, 2009
In addition to posting my own content, I will from time to time post links to the great new Open XMLAnonymous
March 23, 2009
I've just run into the child element ordering issue with XNodeEqualityComparer.Equals. This function would be much more useful if this behavior could be disabled. At present it seems to be doing little more than a string comparison. As you mention, it's entirely possible to automatically determine whether or not this should be allowed via a schema. An easier to use (but less precise) alternative might be to allow the user to specify generic behavior via properties to the XNodeEqualityComparer instance. An issue regarding this behavior was opened a few months ago on the Visual Studio feedback site. You can view it at the following url: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=386901Anonymous
March 24, 2009
Hy Waylon, Are you speaking of the issue where we could possibly order child elements if mixed content is not allowed, and if the element validates against a model group whose compositor is xsd:all? I have the code to do this normalization, and can post it if you're interested. -EricAnonymous
March 24, 2009
Eric, I believe these issues are essentially the same. The specific object of annoyance for me is XNodeEqualityComparer but, from your description, it seems to share its rather rudimentary implementation of equality comparison with XNode. I haven't decided what my strategy is going to be but your code could be very useful. Thanks for presenting this issue with such clarity and in such detail. Thanks also for your work towards a standards based solution. -WaylonAnonymous
May 26, 2009
From the XML 1.1 Specification section 3.1 "Note that the order of attribute specifications in a start-tag or empty-element tag is not significant.", and "[Definition: An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag." The behavior coded in DeepEquals is a bug, clear as day.Anonymous
March 23, 2010
Hi Eric, Thanks for this extremely useful post! I have 2 requests/questions.
- I noticed that there is no normalization of whitespace for attribute values, spaces between attributes (and the equals for the attribute), and element values (i.e. leading/trailing whitespace and replacing multiple spaces between values with a single space. I've noticed that the normalization is taken care of when using a schema but is not otherwise. Is there any particular reason that you left this out?
- You mentioned that you optimized for compactness and readability of code instead of optimizing for processing speed. I am in a situation where there are other areas of code that are not orders of magnitude slower and the performance here is very important. I've looked thru you example here and don't see any glaring possibilities for improvement (certainly I'm missing something). Could you please post an optimized version of the code (and possibly a version that includes whitespace normalization)? It would be extremely helpful. Thanks! Dave Black
Anonymous
March 24, 2010
The comment has been removedAnonymous
March 24, 2010
Hi Eric, Thanks for the quick response. After further testing, I did find that the whitespace normalization for Attributes was correctly handled when using a schema. I went ahead and added to/modified your code to handle whitespace normalization for both attribute and element values. I also added the ability for comparison of attribute and element values by case sensitivity and CultureInfo. I added more tests to your code to cover these scenarios. I can post the code or send it to you if you'd like. Dave BlackAnonymous
March 26, 2010
Hi Dave, I'd love to see the code, and post it as an addendum to this post. -EricAnonymous
March 29, 2010
The comment has been removedAnonymous
March 29, 2010
Hi Dave, This normalization code disregards namespace prefixes entirely, and simply compares the fully qualified names. In both of those examples, there is a namespace prefix declaration, however, those prefixes are never used. I took advantage of one characteristic of LINQ to XML, which is that LINQ to XML looks to 'namespace attributes' to determine namespace prefixes. By removing those namespace prefix declarations from the LINQ to XML tree, then we disregard namespace prefixes, but we still have the fully qualified name as part of the XName for elements/attributes. Those namespace prefix declarations have no effect on the semantic content of the XML document if they are not used. If they are used, then of course, the namespace is used for the fully qualified name for those elements/attributes where they are used. Because the namespace prefix is never used, then those tests should evaluate as equivalent. If you add an element or attribute that also uses that 'c' namespace prefix, then after normalization, if you serialize them, you will see the automatically generated namespace prefixes that LINQ to XML generates. It boils down to this - unless the namespace prefix is used, it has no effect on the semantic content of the XML document. If it is used, then it isn't that we care what the namespace prefixes are, we only care that the fully qualified names of the various elements/attributes are equivalent. Does this make sense? -EricAnonymous
March 29, 2010
The comment has been removedAnonymous
March 30, 2010
The comment has been removedAnonymous
March 30, 2010
Hi Dave, Actually, the namespace prefixes that you are seeing are not part of the normalized XML tree. Those are namespace prefixes that LINQ to XML automatically generates if there are no 'namespace attributes' that define the namespace prefixes, and if more than one namespace is used. If there are no 'namespace attributes' in the XML tree, and elements are in a single namespace, then LINQ to XML will use the default namespace for them. For normalization, in most cases, we simply don't care what the namespace prefixes are. We only care what the fully qualified name is. The algorithm that LINQ to XML uses to determine those namespace prefixes is not defined. Before serialization of a normalized tree in the examples in this post, LINQ to XML doesn't know anything about namespace prefixes; it only knows about qualified names. Then when serializing, it needs to assign arbitrarily defined namespace prefixes so that it can serialize valid XML. -EricAnonymous
August 12, 2010
Hi EricWhite/Dave Black, I am tasked with comparing two data represented us XML and seems your approach makes a good place to start with. I am wondering that if Dave Black had posted his modification to Eric's code..If so where can i find it? thanks, BizNetAnonymous
August 12, 2010
Hi Biznet, I haven't posted the code yet. I did send it to Eric and am waiting for him to review it and provide feedback. DaveAnonymous
June 13, 2014
Would be nice is this could be a nuget package. Or pulled into a package of similar equality checks.