This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
![]() |
|
SAX, the Simple API for XML | |
Aaron Skonnard | |
Download the code for this article: XMLFiles1100.exe (108KB) Browse the code for this article at Code Center: SAX ContentHandler Implementations |
|
oday there are two widely accepted APIs for working with XML: the Simple API for XML (SAX) and the Document Object Model (DOM). Both APIs define a set of abstract programmatic interfaces that model the XML Information Set (Infoset). The DOM models the Infoset through a hierarchy of generic nodes that support well-defined interfaces. Due to the DOM's tree-based model, most implementations demand that the entire XML document be contained in memory while processing. SAX, on the other hand, models the Infoset through a linear sequence of well-known method calls. Because SAX doesn't demand resources for an in-memory representation of the document, it's a lightweight alternative to the DOM. SAX Interfaces SAX defines a set of interfaces that model the Infoset (see Figure 1). SAX was originally defined for the Java programming language using Java interface definitions. Because the Java-language interfaces are not language-neutral, it's up to tool vendors to decide exactly how the SAX interfaces should map to their specific language. At some point in the future, however, standard language bindings will undoubtedly emerge. Microsoft added support for SAX in the last two preview releases of MSXML 3.0. In May, they added C++-based support for SAX. A few months later, in July, they added SAX support in Visual Basic®. Each of these language bindings requires a different set of interfaces that reflect the individual language and type restrictions. The names of the MSXML 3.0 SAX interfaces are shown in Figure 2. The names of the MSXML interfaces are the same as they are in the Java language, only prefixed with ISAX for C++ and IVBSAX for Visual Basic (ISAXContentHandler and IVBSAXContentHandler, for example). Throughout the rest of this column, I'll refer to the SAX interfaces by their native names, and you can perform the translation of your choice. ContentHandler ContentHandler is the primary SAX interface that models the Infoset's core information items (the DTDHandler, LexicalHandler, and DeclHandler interfaces model the rest of the Infoset, plus more). A component that wants to receive an XML document implements ContentHandler, while a component that wants to send an XML document consumes ContentHandler. Figure 3 describes each of the ContentHandler members.
This example generates an XML document that could be serialized as follows:
Notice that startElement and endElement take three name parameters: the element's namespace URI, local name, and qualified name (QName). It's up to the caller to specify the correct namespace information for each element and attribute in the startElement and endElement method calls.
In order to properly process the xsi:type attribute, the ContentHandler implementation needs to know the namespace URI that is associated with the geo prefix. The startPrefixMapping and endPrefixMapping methods supply this information to ContentHandler implementations. StartPrefixMapping is called just before startElement for the element on which the namespace mapping should begin. EndPrefixMapping is called just after the endElement call that closes the corresponding startElement call. The following consumer code illustrates when they should be called for the foo document fragment cited previously:
The last parameter to startElement is a reference to an Attributes interface, which models the element's collection of attributes. This interface makes it possible to access attribute information by index or name. When accessing information by name, either the QName or the namespace name (namespace URI plus local name) may be used for retrieval. The accessible attribute information includes the attribute's namespace URI, local name, QName, type, and value.
This sample generates a person element that could be serialized like this:
The ContentHandler interface uses the characters method to model a sequence of character information items that occurs within element content. The identical ignorableWhitespace method models any ignorable whitespace that occurs within element content. Whitespace that occurs within element-only content is considered ignorable because it's only present for readability. The only way a processor can determine that a content model is element-only is by looking at the associated DTD or Schema. If no DTD or Schema is present, whitespace is always considered significant. The current MSXML SAX processor is nonvalidating, so all whitespace characters are always considered significant and are therefore passed through as characters instead of ignorableWhitespace.
The resulting serialized document would look like this:
The last couple of ContentHandler content-related methods are meant for modeling processing instructions and skipped entities. ProcessingInstruction simply takes target and data arguments in the form <?target data data data?>. The data portion of a processing instruction is everything that comes after the whitespace that separates the target. The skippedEntity method signals that the caller skipped a specific entity identified by name. The following code illustrates these methods:
The document that this code represents might look something like this, assuming that the ouch entity is skipped by the caller:
The last member of ContentHandler is used for passing a Locator interface reference to the ContentHandler implementation. The implementation can use the Locator interface to query for contextual information about the caller, such as line and column number and public/system identifiers of the current entity. This information comes in handy, especially when the caller is an XML parser. Now that I've looked at ContentHandler fundamentals and various ContentHandler consumer examples, let's look at some sample ContentHandler implementations. ContentHandler Implementations The canonical example of a ContentHandler implementation simply serializes the received method calls back out as an XML 1.0 namespace-aware document. The code for such an implementation needs to follow the syntactical productions defined by XML 1.0 and Namespaces for serializing each information item to the output stream. I've provided a fairly generic implementation of this in a Visual Basic class named CSerializer. The partial code for CSerializer is shown in Figure 5.
These features have an initial state of true and false, respectively.
CSerializer allows the client to register an output stream before an XMLReader begins making the content-related method calls (startElement, processingInstruction, and so on). The output stream can be one of three types: a reference to a text file (Scripting.TextStream), a reference to the ASP Response object (ASPTypeLibrary.Response), or a standard Visual Basic-based textbox control for display and testing purposes. It would be fairly easy to extend the class to support additional output types. The following example illustrates how to use CSerializer to generate an XML file on disk:
Generating XML from within an ASP page is just as trivial, as shown by the following example:
A more practical implementation of ContentHandler could deserialize an XML document into an application-specific type. For example, consider the following XML document that represents a CInvoice class:
Another implementation of ContentHandler could simply deserialize this document back into a native CInvoice instance. To accomplish this, the ContentHandler implementation must maintain several pieces of contextual information during processing. One way to do this is to build a simple state machine that represents the current element context. Using a stack, startElement can push the current element state so future method calls can figure out which element is currently being processed, as shown in Figure 6. EndElement needs to pop the current element state off the stack to keep things balanced. It's then up to the characters method to populate the instance with the appropriate data at the appropriate time. Figure 7 shows the complete class file.
I've provided several other ContentHandler implementations with this column that illustrate how SAX and DOM can be used together. The CDOM2SAX class demonstrates how to walk a DOM tree and emit ContentHandler method calls. The CSAX2DOM class demonstrates how to build a DOM tree from a stream of ContentHandler method calls. And finally, instead of building a full DOM tree for a document, CSAXFilter2DOM demonstrates how to filter for certain elements and only build the identified element subtrees. These classes are available as part of the sample download for this issue, located at the link at the top of this article. Other Interfaces ContentHandler implementations can also implement the ErrorHandler, DTDHandler, and EntityResolver interfaces. The ErrorHandler interface is used to provide custom handling of caller-generated errors. An implementation of the IVBSAXErrorHandler interface would look like the code shown in Figure 9.
The DTDHandler interface makes it possible to process unparsed entities and notations that appear in the document's DTD. Unparsed entities are used to attach non-XML data streams (identified by a system/public ID pair) to an XML document for application-level processing. Every unparsed entity is also associated with a notation that specifies what type of data the entity represents. The DTDHandler interface contains two methods for processing these items: unparsedEntityDecl and notationDecl. SAX also defines the EntityResolver interface for custom resolution of external entities. The interface contains a single method, resolveEntity, which allows the implementation to supply application-specific resolution rules for external entity identifiers. MSXML supports the DTDHandler interface, but does not currently support EntityResolver. In addition, the current MSXML release doesn't implement anything equivalent to the SAX InputSource class, which was defined to encapsulate all SAX I/O. Extended Interfaces Although not part of core SAX, there are two additional interfaces, LexicalHandler and DeclHandler, that make it possible to process lexical and DTD-related document information. In the Java-language space, these interfaces are distributed in a separate package called SAX-ext to emphasize that processors are not required to support them. Because these interfaces are considered extensions, they are not registered with an XMLReader in the same way as the other interfaces, as you'll see in the next section.
The DeclHandler interface makes it possible to signal DTD declarations including:
XMLReader XMLReader is the main interface that SAX producers implement. XMLReader serves several purposes. First, XML consumers use this interface to register their implementations of the other SAX interfaces (such as ContentHandler, ErrorHandler, and so on). Second, XMLReader makes it possible to configure the desired behavior of the SAX processor through two generic methods: setFeature and setProperty. And finally, XMLReader encapsulates parsing functionality.
XML consumers can also configure the behavior of the producer with respect to the different aspects of processing an XML document. The putFeature and putProperty methods encapsulate this process. SAX defines several well-known features and properties that SAX processors are encouraged to recognize, although not all of the standard features or properties make sense for all implementations. Implementors are also free to add their own proprietary extensions as long as they use unique feature and property IDs (see the MSXML documentation for support).
Also, for an application to register a LexicalHandler/DeclHandler interface with an XMLReader it must do so through a putProperty call, as shown here:
Once the XML consumer has registered its interface implementations and configured the behavior, the producer can start pushing the document through the appropriate methods. One of the most common types of SAX producers is an XML parser. As the XML parser decomposes the XML 1.0 serialized stream, it can pass the information to the registered SAX consumer. The SAXXMLReader and VBSAXXMLReader coclasses do exactly this.
The SAX model of streaming Infosets between producer and consumer can be extended to include additional transparent interceptors. As long as a component understands the SAX interfaces, it can be placed between a producer and consumer to filter the method calls going in either direction. This makes it possible to build SAX pipelines that consist of several components, each of which has a specific processing responsibility.
This type of stream-based processing model is not possible with a traditional DOM implementation. Finally, I pulled together several SAX methods into a unified sample, shown in Figure 12. The code for this sample is available from the link at the top of this article.
Conclusion SAX offers a lightweight alternative to DOM. It facilitates searching through a huge XML document to extract small pieces of informationâ€"and it allows premature aborting when the desired piece of information is found. SAX was designed for any task where the overhead of the DOM is too expensive. |
|
Aaron Skonnard is an instructor and researcher at DevelopMentor, where he develops the XML curriculum. Aaron coauthored Essential XML (Addison-Wesley Longman, 2000) and wrote Essential WinInet (Addison-Wesley Longman, 1998). Get in touch with Aaron at https://staff.develop.com/aarons. |
From the November 2000 issue of MSDN Magazine.