Binary Encoding, Part 3
Past parts in the series:
Today I’ll talk about the XML features that are and aren’t supported by the binary encoding format we use in WCF.
Since the binary format was designed for a specific purpose, round-tripping essentially the XML infoset being manipulated in memory as opposed to round-tripping the rendered XML documents, several features that are only relevant at the level of a rendered document are omitted. Similarly, features that only have significant differences from other features for rendered documents are omitted to canonicalize the representation.
Here’s the general list of XML features that are not supported:
- Documents that look a lot like XML but aren’t syntactically correct (many web pages don’t strictly follow the rules for XML because web browsers are generally very forgiving)
Processing instructions (including the XML processing instruction that contains the character set)
- DTDs
- Character references and expansions
- The compact format for elements without content that is self-contained rather than having a closing tag (since we only support legal XML and human readability is not a goal, we can encode the end tag in a single byte token all the time already)
- CDATA sections
- Preservation of significant whitespace
That leaves almost every other XML feature you might think of as supported by one record type or another. The list includes structural features, such as elements, attributes, namespace declarations, and comments. The list also includes content features, such as booleans, integers, floating-point numbers, fixed-point numbers, strings, dates, time spans, byte arrays, guids, unique identifiers, and qualified names.
The encoding tricks of the binary format are primarily through the choice of supported record types, having variable-sized integers to reflect that most of the needed values are small, and using numerical references to interned strings rather than repeating the contents of the string each time it is used. Going over some examples of records next time should illustrate these common features.
Comments
Anonymous
September 19, 2009
Hi Nicholas, why wasn't the Fast Infoset standard an option? But basically XML compression is a very good thing, as memory comsumption can really explode in XML. A dataset of 20 megs is easily a 100 megs in XML. Then we tranlate the XML into another document which is another 100 megs. This means for processing the data in XML we need 10 times the memory of the original data size. Cheers, TobiasAnonymous
September 20, 2009
Hi Tobias, The Fast Infoset standard is more complex than the binary format we ended up using and slower given the message processing pipeline we have. The advantage of Fast Infoset is a smaller message on the wire although the two are essentially the same after compression is applied. I think Fast Infoset would have been interesting as the interoperable alternative instead of MTOM although getting the greatest value from Fast Infoset requires using an external string table (which isn't standardized).