Share via


From the December 2001 issue of MSDN Magazine.

MSDN Magazine

What's New in MSXML 4.0
Aaron Skonnard
Download the code for this article: XML0112.exe (50KB)
Browse the code for this article at Code Center: XML Schema Validator

T

he Microsoft® XML Parser (MSXML) 4.0 is full of new fea-tures and functionality related to XML Schema and SAX. XML Schema Definition (XSD) validation and reflection are the most significant of these, but there are many others that deserve attention as well. So even if you're still living in the reality of the "unmanaged" world, MSXML is full of improvements that will help you enhance your XML-based applications today.

MSXML 4.0 Technology Preview

      It was just about a year ago, in the September 2000 issue of MSDN® Magazine, that I wrote about what was new in MSXML 3.0, covering the numerous deltas between versions 2.0 and 3.0. This month's column picks up where that one left off by explaining the various facets of MSXML's continued evolution to version 4.0, which should be released by the time this issue hits the stands.
      This time around there are just as many significant improvements to discuss, if not more. So even though using XML through the Microsoft .NET Framework seems to be attracting the majority of the industry's attention right now, it's important to point out that the MSXML 4.0 team has been making equal strides in terms of standards support, ease of use, performance improvements, and other innovative enhancements. If you're still writing unmanaged code that deals with XML documents, MSXML 4.0 will be a welcome upgrade to your application.
      Instead of diving right into the new functionality, I'll start with a look back at what was thrown out of MSXML 4.0 to streamline and simplify the library. After that I'll tackle the new XML Schema API, by far the most significant addition to the library, and discuss how it can be used to perform validation using either SAX or DOM. Then I'll look at some examples of how to reflect against schema type information at runtime using SAX, DOM, or even XPath expressions. And finally, I'll cover other miscellaneous enhancements to the core SAX components.

Out with the Old, in with the New

      Microsoft has never been one to blithely remove antiquated functionality from their libraries. The July 2001 release of the MSXML 4.0 Technology Preview marks a slight change in that pattern as they completely removed outdated functionality in certain key areas. This change forces developers upgrading to MSXML 4.0 to migrate all of their existing code that depends on the archaic bits, but it also simplifies, clarifies, and streamlines the overall library. To help understand what led to this decision, I'll briefly take a look at the history of MSXML.
      When MSXML 2.0 first shipped, it introduced support for one of the first W3C XSL working drafts, which included a path-based language known at the time as XSL Patterns. It also included push-model interfaces available through the IXMLParser and IXMLNodeFactory interfaces. Although many considered them reference implementations of evolving specifications, eager developers quickly latched on and didn't want to let go.
      As time passed, the underlying specifications solidified. The original XSL working draft was factored into several independent specifications, including XPath 1.0 and XSLT 1.0. And during the same time frame, SAX evolved into the de facto streaming (push-model) API for XML documents. The XML standards landscape had changed dramatically and MSXML had to follow suit.
      After several interim Web releases, the MSXML team shipped version 3.0, which included support for these now stationary specifications. But instead of immediately deprecating the preceding versions, they decided to leave them in and make them the default implementation in order to foster backward compatibility.
      Although the MSXML 3.0 components are installed in side-by-side mode by default, you can optionally run xmlinst.exe to place them in replace mode. Replace mode modifies the Microsoft.XMLDOM and MSXML.DOMDocument PROGIDs to use the new MSXML 3.0 components. This change forces older applications to load the newer components.
      This decision ultimately backfired because it confused developers and the replace mode made applications harder to maintain and somewhat fragile. The confusion was evident on the MSXML-related mailing lists and newsgroups where a large percentage of the most common, recurring questions were directly related to the differing implementations.
      For example, developers anxious to try out the new XPath 1.0 support in MSXML 3.0 would immediately write code like this

  var doc = new ActiveXObject("MSXML2.DOMDocument");
  
sel = doc.selectNodes("/descendant::LineItem"); //fails

 

only to find out that the call to selectNodes doesn't understand the descendant axis because by default it interprets the string as an XSL Patterns expression. To use XPath 1.0 expressions, developers had to toggle the document's SelectionLanguage property ahead of time to switch the underlying implementation:

  var doc = new 
  
ActiveXObject("MSXML2.DOMDocument");
doc.setProperty("SelectionLanguage",
"XPath");
sel = doc.selectNodes("/
descendant::LineItem"); //works

 

      A similar issue existed with the XSL/XSLT implementations, only it was subtler since the actual implementation was determined by the namespace in the transformation document.
      You might argue that since all of this was clearly documented, developers should have known better. But in reality, no matter how good the documentation, having too many options and not-so-obvious defaults always leads to confusion.
      In addition to the confusion, applications relying on MSXML's version-independent PROGIDs and replace mode would often experience problems after applying service packs, upgrades, or new applications to the same machine. For example, let's assume a developer writes an application that uses MSXML 3.0 components. The developer decides to use version-independent PROGIDs so his code will automatically load newer versions of the MSXML components once they're installed (to take advantage of bug fixes, and so on). The developer deploys the application to a production server and everything seems to run smoothly.
      Then, a few months later, it's time to upgrade the production machine to SQL Server™ 2000. The XML developer can't think of any reason why the upgrade would affect his application, so he gives the go-ahead. What he doesn't realize, however, is that SQL Server 2000 installs MSXML 2.6, which resets the version-independent PROGID values to the MSXML 2.6 components, which were not originally tested by the developer (and that undoubtedly contain bugs that were later fixed in version 3.0).
      There are situations in which version-independent PROGIDs can be useful, but they can also cause maintenance problems because it becomes harder to predict how changes to the machine's environment will adversely affect things.
      Problems like this one coupled with the resulting confusion forced the MSXML team to simplify the library by completely removing the legacy code along with replace mode starting with the July 2001 Technology Preview release. The library liposuction got rid of version-independent PROGIDs, the original XSL/XSL Pattern implementations, and the IXMLParser/IXMLNodeFactory functionality. The result is a more streamlined library that is easier for developers to use.
      MSXML 4.0 developers must now use explicit version-dependent PROGIDs/coclass names, as shown here:

  var doc = new ActiveXObject("MSXML2.DOMDocument.4.0");
  

 

This ensures that installing MSXML 4.0 will have no adverse effect on other existing applications. And removing the antiquated implementations clarifies the landscape for new developers.
      But removing replace mode also means that you can no longer control which version of the XSLT engine is used by Microsoft Internet Explorer to process the <?xsl-stylesheet?> processing instruction that is often used to render XML documents. Internet Explorer 5.0 will now be restricted to MSXML 2.0 (or MSXML 3.0 in replace mode) until they deploy a new version written against the latest MSXML release. Of course, Web developers can still use the latest and greatest XSLT engine, or any other new MSXML components for that matter, by simply referencing the appropriate version-dependent PROGIDs in script.
      Overall, these changes are positive for the future of MSXML. Now let's dive into the new stuff. Note: before installing the July 2001 Technology Preview release, see the "MSXML 4.0 Installation Notes" sidebar for additional information.

A New XML Schema API: SOM

      Probably the most significant addition to MSXML is the new Schema Object Model (SOM) based on the W3C XML Schema definition language (XSD) Recommendations. The SOM offers an in-memory object-graph representation of an XSD schema definition. In other words, the DOM is to an XML document what the SOM is to an XSD document.

Figure 1 SOM as Metadata
Figure 1 SOM as Metadata

      An instance of a SOM is loaded into an XMLSchemaCache object and associated with a namespace URI. The cache can then be used in conjunction with either the SAX or DOM APIs for validation or reflection (see Figure 1). The schema metadata makes it possible to walk up to a node at runtime and determine its type, which consists of both structural and value-space information, and whether the node in question is a valid instance. Type-driven code can be quite powerful, as illustrated by today's mainstream Java and .NET reflection techniques.
      Figure 2 illustrates the SOM interface hierarchy. Each of the nodes in the SOM tree will implement one or more of these interfaces to provide access to a specific portion of the schema definition.

Figure 2 SOM Interface Hierarchy
Figure 2 SOM Interface Hierarchy

      ISchemaItem is at the root of the hierarchy. Every node in the SOM implements this interface (equivalent to the DOM Node interface), which provides basic information like the node's name, namespace, and type. There are also derived interfaces for modeling the overall schema definition (ISchema), individual simple and complex type definitions (ISchemaType/ISchemaComplexType), particles such as elements, model groups, or wildcards (ISchemaParticle, ISchemaElement, and so on), and attributes/attribute groups (ISchemaAttribute/ISchemaAttributeGroup).
      All of these interfaces come together to provide a programmatic interface to a logical XSD schema definition. If you're used to working with the DOM, you should feel right at home with the SOM.

Schema Validation Techniques

      Before inspecting the mechanics of schema validation, I'll briefly discuss why you would want to use XSD validation over Document Type Definition (DTD) validation. Simply put, XSD validation is superior because it addresses value spaces in addition to structural constraints. With DTDs, there were only a few simple, text-based types such as #PCDATA, ID, IDREF, NMTOKEN, and so on, but nothing that resembled common programming language types such as string, long, or double.
      XSD filled this void by providing a new set of primitive and built-in types that intuitively map to most common programming language and database type systems. XSD also makes it possible for developers to define their own simple types that restrict the value space of another type. Then developers can rely on an XSD validator to determine whether a given value is within the predefined value-space.
      The XSD schema definition in Figure 3 shows an example of defining a few new simple types or value spaces based on the built-in type integer. Figure 4 illustrates three different value spaces in use: the predefined value space for integer, a restriction of integer called age that has a value space of 0-125 (inclusive), and a restriction of age called toddlerAge that has a value space of 0-3 (inclusive). With this kind of type information available, it's possible for an XSD validator to determine whether the lexical value of the monica element is a valid instance of one of these types. XSD is capable of much more, but this is the fundamental difference between XSD and DTD validation. For more details, see the MSXML 4.0 documentation or the XSD primer at https://www.w3.org/TR/xmlschema-0.

Figure 4 XSD and Value Spaces
Figure 4 XSD and Value Spaces

      Now let's turn to the actual API for performing XSD validation. With the DOM, you can either load the schema into an XMLSchemaCache object ahead of time or you can rely on the xsi:schemaLocation attribute in the instance document. To use the schema cache approach, first you need to load the schemas into an XMLSchemaCache (via the add method) and then associate it with a DOMDocument through the schemas attribute. Then as long as the DOMDocument's validateOnParse property is set to true (the default), validation will occur against the cached schemas during load (or during an explicit call to validate). See Figure 5 for an example.
      If I had declared the monica element to be of type long or age in my schema, this script would have reported that this code was valid (see Figure 4). But if I had declared monica to be of type toddlerAge, it would have declared the instance as invalid because monica's content (or value) is out of the allowable range (0-3). The script produces an error that looks something like this:

  ### error: (7, 6) The element:'{https://example.org/mytypes}
  
monica' has an invalid value according to its data type.
### source: >29</m:monica>

 

      It's even easier if the instance documents use the xsi:schemaLocation attribute, as shown here:

  <m:monica 
  
xmlns:m='https://example.org/mytypes'
xmlns:xsi='https://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='https://example.org/mytypes
schema.xsd'
>29
</m:monica>

 

In this case, validation will automatically occur against the referenced schema definition without explicitly using a schema cache.
      The JScript®-based program in Figure 6 is the canonical command-line validation utility that is capable of either approach, as well as DTD validation. It expects you to provide the instance file name along with optional schema file locations (you can omit this if xsi:schemaLocation is used in the instance). You don't have to provide the schema namespace since the utility automatically digs into the provided schema definition and reads it from the targetNamespace attribute. You can also provide multiple schemas on the command line, as shown here:

  C:> validate monica.xml -s schema1.xsd -s schema2.xsd �
  

 

      I've also provided a sample Visual Basic®-based application for those who prefer to work with GUIs (you'll find the code at the link at the top of this article). The app allows you to load an XSD schema into the left pane, and an instance document into the right pane, both of which you can modify before validating (see Figure 7).

Figure 7 Visual Basic Schema Validator
Figure 7 Visual Basic Schema Validator

      The SAX API also supports schema validation, but only through a schema cache as well as via xsi:schemaLocation and xsi:noNamespaceSchemaLocation. First, schema validation has to be enabled through a call to putFeature, specifying the schemavalidation as the feature name. Next, you can populate an XMLSchemaCache object and associate it with the SAXXMLReader through a call to putProperty. Then the SAX parser will perform schema validation during the call to parse/parseURL.
      Validation errors are reported through the SAX ErrorHandler interface before terminating the parse. So to find out about validation errors, you need to implement ISAXErrorHandler/IVBSAXErrorHandler and associate it with the SAXXMLReader before calling parse. The following code fragment in Visual Basic illustrates how to configure a SAX reader for schema validation:

  Dim reader As New SAXXMLReader40
  
Dim sc as New XMLSchemaCache40
sc.add "https://example.org/mytypes", "schema.xsd"
reader.putFeature "schema-validation", True
reader.putProperty "schemas", sc
Dim mh As New MyHandler ' implements IVBSAXErrorHandler
Set reader.errorHandler = mh
reader.parseURL "monica.xml"

 

For a more complete example, I've provided another version of the Visual Basic-based Schema Validator sample that uses SAX underneath (see the download).

DOM + SOM = Reflection

      In addition to validation, schemas can also be used at runtime for reflection purposes. Schemas in the XML world offer all the same benefits that IDL/type libraries provided in the COM+ world including, but not limited to, the following:

  • Better tool support (such as IntelliSense® and WYSIWYG)
  • Code generation (such as header files and source code files)
  • Dynamic proxies/stubs that hide the XML details of Web Services
  • The potential to write programs that write programs

      In MSXML 4.0 you can access schema-based type information while processing an XML document using either the DOM or SAX APIs. Doing so with the DOM requires the cooperation of an XMLSchemaCache object. XMLSchemaCache exposes a getSchema method for retrieving the root schema object.
      For example, the following code retrieves the root schema object for the https://example.org/mytypes schema:

  ' load schema cache
  
Dim sc as New XMLSchemaCache40
sc.add "https://example.org/mytypes", "schema.xsd"
' retrieve schema from cache
Dim schema as ISchema
Set schema = sc.getSchema("https://example.org/mytypes")
•••

 

Once you have a reference to an ISchema implementation, you have access to everything within the schema definition, such as the target namespace, global element/attribute declarations, simple/complex type definitions, model groups, and notations. These are all exposed through ISchema properties, some of which are of type ISchemaItemCollection (see Figure 8).
      You can drill down into a particular schema item using the appropriate interfaces. The following example in Visual Basic iterates through the collection of global element declarations and prints the element's name along with its type name:

  Dim si As ISchema
  
Set si = sc.getSchema("https://example.org/mytypes")
Dim el As ISchemaElement
For Each el In si.elements
Debug.Print el.Name & " (type=" & el.Type.Name & ")"
Next

 

      This, of course, assumes that you're starting with the schema and you know what you're looking for. The API also supports starting with an IXMLDOMNode and asking for its corresponding schema declaration through the getDeclaration method. For example, the following snippet prints some of the type information for any IXMLDOMNode supplied to the method:

  Public Sub PrintTypeInformation(n As IXMLDOMNode)
  
Dim si As ISchemaItem
Set si = sc.getDeclaration(n)
Debug.Print si.Name & ", " & si.namespaceURI
If (TypeOf si Is ISchemaElement) Then
' treat as element declaration
ElseIf (TypeOf si Is IschemaAttribute) Then
' treat at attribute declaration
' handle other declarations you care about
End If
End Sub

 

Notice that the getDeclaration method returns a generic ISchemaItem reference, which provides basic information like name and namespace URI. For more details on the item, you need to query for the appropriate item-specific interface and use it instead.
      Since the SAX implementation also makes use of the SOM (as illustrated in the validation section), you might still use the schema-drilldown approach even if you want to avoid the DOM throughout the rest of your code. What's different about SAX, however, is how you find out about the type declarations for the elements and attributes that you receive through calls to ContentHandler.
      Instead of modifying the ContentHandler members to deal with type information, a new handler interface called IMXSchemaDeclHandler was defined. This interface only contains a single member named schemaElementDecl that receives an ISchemaElement reference as input:
Sub schemaElementDecl(By Val sel As ISchemaElement)
      Code that wants to find out about type information while processing a ContentHandler stream needs to implement IMXSchemaDeclHandler and register the implementation with the SAXXMLReader. Also, schema validation must be enabled on the reader as with validation. The following example shows how to initialize the reader before calling parse:

  Dim mh As New MyHandler
  
Dim reader As New SAXXMLReader40
reader.putFeature "schema-validation", True
reader.PutProperty "schemas", sc
reader.PutProperty "schema-declaration-handler", mh
reader.parseURL "monica.xml"

 

      Now the SAXXMLReader will deliver an ISchemaElement reference right before the call to startElement for the instance. You can access attribute type information by walking the attribute declarations collection on the element's ISchemaComplexType interface (accessed via the scope property). Interleaving the type information with the ContentHandler stream fits nicely into the streaming programming model.

XPath and Type Information

      Now that you've seen how to use schemas for both validation and reflection in the DOM and SAX APIs, I'll address some of XML's layered technologies, like XPath and XSLT, that would also benefit from increased type awareness.
      The XPath 1.0 specification doesn't support type-based expressions. In fact, the XPath data model only understands four simple types: node-sets, strings, numbers, and Booleans. It's possible with XPath 1.0 to ask for all element nodes or all comment nodes, but not for all attributes of type double or elements of type Person.
      Since type awareness would be a valuable extension to the core language, the XPath 2.0 working group is currently discussing different approaches to such extensions. However, since MSXML 4.0 already supports XML Schema in the underlying APIs, they decided to let that functionality bubble up through a set of Microsoft-specific XPath extension functions.
      The way they defined the extension functions is completely in line with the XPath 1.0 specification, but they will obviously only work with MSXML 4.0 today. All of the extension functions are defined in the context of the urn:schemas-microsoft-com:xslt namespace, which must be used to qualify each of the function names. Figure 9 describes each of the XSD-related XPath functions along with its syntax.
      First, the schema-info-available function determines whether XSD information is available for the current node. If it is, the other three functions can be used to find out more about the type. type-local-name and type-namespace-uri simply return the type's local name and namespace URI for either the current node or the supplied node, while type-is can be used to compare the current node's type to the supplied name.
      To use these extension functions, you first need to associate urn:schemas-microsoft-com:xslt with a prefix to use when referencing the function names (for example, ms:type-is). In the DOM, this is accomplished through a call to setProperty as shown in Figure 10.
      If you want to use the extension functions in XSLT, you simply use a standard namespace declaration:

  <xsl:transform version='1.0'
  
xmlns:xsl='https://www.w3.org/1999/XSL/Transform'
xmlns:ms='urn:schemas-microsoft-com:xslt'
>
<xsl:template match="/">
<xsl:apply-templates select="//*[ms:type-is(
'https://example.org/mytypes', 'toddlerAge')"/>
</xsl:template>

•••
</xsl:transform>

 

You still need to associate an XMLSchemaCache object with the input DOMDocument before executing the transformation.
      In the September 2000 XML Files column I provided an interactive XPath expression builder to assist readers in learning the details of the language. I've since updated the utility to support the XSD extension functions by making it possible to associate a schema with the document being queried. After loading both the schema and instance documents and binding a prefix to the extension namespace, you can experiment with extension functions as shown in Figure 11.

Figure 11 Interactive XPath Expression Builder
Figure 11 Interactive XPath Expression Builder

      In addition to the XSD-related extension functions, MSXML 4.0 has introduced a slew of additional helper functions to facilitate date/time formatting, date/time comparisons, as well as QName resolution, as described in Figure 12. See the MSXML 4.0 documentation for more details on these.

Other SAX Enhancements

      In addition to the schema-related SAX enhancements, several other changes/additions were made to give the core SAX implementation a nice face-lift. These include a faster parser, better SAX/DOM integration support, standard SAX writer implementations for both XML 1.0 and HTML output streams, and a built-in namespace manager to assist with QName processing.
      According to the MSXML team, the new internal XML 1.0 parser is about twice as fast as the existing parser. The MSXML 4.0 SAX implementation uses this new parser by default, but the DOM implementation doesn't. However, you can coerce a DOMDocument into using the new parser by calling setProperty, specifying NewParser before calling load:

  Dim doc as new DOMDocument40
  
doc.setProperty "NewParser", true
doc.load "monica.xml" ' uses newer, faster parser

 

      The improved SAX/DOM integration is very similar to the examples I provided in my November 2000 column that illustrated how to move in both directions between the two APIs. It's nice to have these helper objects as part of the core library since so many SAX developers need this kind of functionality—and now they don't have to start from scratch.
      The MXXMLWriter40 supports building a DOM tree from a SAX stream. MXXMLWriter40 implements the main SAX interfaces and builds the corresponding DOM nodes as it goes. Once it's done, you can access the resulting DOM tree as if you had loaded it through the load method. The following code fragment illustrates this:

  Dim doc as New DOMDocument40
  
Dim mx as New MXXMLWriter40
mx.output = doc

Dim reader as New SAXXMLReader40
Set reader.ContentHandler = mx
' parse using SAX but ContentHandler impl generates DOM
reader.parseURL "monica.xml"
' print the resulting DOM tree
Debug.Print doc.xml

 

      You can also go the other way if you want to convert a DOMDocument into a sequence of SAX method calls. The MSXML team made this possible by overloading the SAXXMLReader's parse method to also accept a DOMDocument reference, as shown here:

  Dim doc as New DOMDocument40
  
doc.load "monica.xml"
' start with DOM tree

Dim reader as New SAXXMLReader40
Set reader.ContentHandler = new MyHandler
reader.parse doc ' produces a stream of SAX method calls

 

      The MXXMLWriter40 class also supports outputting XML 1.0, and it allows you to control certain characteristics of the resulting byte stream, including the character encoding, the presence of a byte-order mark, output escaping, pretty-printing (indentations), and whether an XML declaration is produced. For example, the code in Figure 13 shows how to produce an XML 1.0 document.
      The MSXML team also provided a specialized version of the writer for producing HTML documents called MXHTMLWriter. It works exactly the same way, but it provides special treatment for certain well-known HTML elements. For example, the br, hr, and input elements don't require end tags, so the writer doesn't produce them. In addition, the script and style elements don't require escaping of special characters, so the writer leaves them alone. This writer implementation is especially useful for generating HTML from a Web server environment where performance and overhead are critical.
      In addition to all this new support for writers, there is also a new namespace manager class, MXNamespaceManager, which you can use to manage a stack of namespace bindings and their scope. This is handy if you're building SAX applications that have to process QNames used in element content or attribute values (for example, type='xsd:double').
      In general, most of the boilerplate code that many SAX developers found themselves writing from scratch has been provided as part of the core SAX library through these helper classes. Oh, and if you're a C++ developer, you may want to check out the new C++ SAX application wizard, which is available for download on MSDN (https://support.microsoft.com/default.aspx?scid=kb;en-us;q276505).

Conclusion

      The MSXML 4.0 library is full of new features and functionality related mostly to XML Schema and improved SAX. The new-and-improved library also boasts of performance increases, and even the documentation and sample libraries have been nicely beefed up. The bottom line is, even if you're living in an "unmanaged" reality for the next year or two, MSXML is still full of improvements that will help you enhance your XML-based applications today.

Send questions and comments to Aaron at xmlfiles@microsoft.com.

Aaron Skonnard is an instructor/researcher at DevelopMentor, where he develops the XML and Web service-related curriculum. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000). Get in touch with Aaron at https://staff.develop.com/aarons.