Surrogate Pair Characters in an XML Document

A surrogate or surrogate pair is a pair of 16-bit Unicode encoding values that, together, represent a single character. The key point to remember is that surrogate pairs are actually 32-bit single characters, and it can no longer be assumed that one 16-bit Unicode encoding value maps to exactly one character.

The first value of the surrogate pair is the high surrogate, and contains a 16-bit code value in the rage of U+D800 to U+DBFF. The second value of the pair, the low surrogate, contains values in the range of U+DC00 top U+DFFF. By using surrogate pairs, 16 bit Unicode encoded system can address additional one million and more characters(220) that have been defined by the Unicode standard.

You can use surrogate characters in any string passed to an XmlTextWriter method. However, the surrogate character should be valid in the XML being written. For example, the W3C recommendation does not allow surrogate characters inside element or attribute names. If the string contains an invalid surrogate pair, an exception is thrown.

Additionally, you can use WriteSurrogateCharEntity to write the character entity corresponding to a surrogate pair. The character entity is written in hexadecimal format and is generated using the formula:

(highChar -0xD800) * 0x400 + (lowChar -0xDC00) + 0x10000

If the string contains an invalid surrogate pair, an exception is thrown. The following example shows the WriteSurrogateCharEntity method with a surrogate pair as input.

// The following line writes &#x10000
WriteSurrogateCharEntity ('\uDC00', '\uD800');

The following example produces a surrogate pair file, loads it into the XmlReader, and saves the file out with a new name. The original and new file are then loaded back into the application in the DOM structure for comparison.

char lowChar, highChar;
char [] charArray = new char[10];
FileStream targetFile = new FileStream("SurrogatePair.xml",
      FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);

lowChar = Convert.ToChar(0xDC00);
highChar = Convert.ToChar(0xD800);
XmlTextWriter tw = new XmlTextWriter(targetFile, null);
tw.Formatting = Formatting.Indented;
tw.WriteStartElement("root");
tw.WriteStartAttribute("test", null);
tw.WriteSurrogateCharEntity(lowChar, highChar);
lowChar = Convert.ToChar(0xDC01);
highChar = Convert.ToChar(0xD801);
tw.WriteSurrogateCharEntity(lowChar, highChar);
lowChar = Convert.ToChar(0xDFFF);
highChar = Convert.ToChar(0xDBFF);
tw.WriteSurrogateCharEntity(lowChar, highChar);

// Add 10 random surrogate pair.
// As unicode, the high bytes are in lower
// memory, for example, word 6A21 as 21 6A.
// The high or low is in the logical sense.
Random random = new Random();
for (int i = 0; i < 10; ++i) {
      lowChar = Convert.ToChar(random.Next(0xDC00, 0xE000));
      highChar = Convert.ToChar(random.Next(0xD800, 0xDC00));
      charArray[i] = highChar;
      charArray[++i] = lowChar;
}
tw.WriteChars(charArray, 0, charArray.Length);

for (int i = 0; i < 10; ++i) {
      lowChar = Convert.ToChar(random.Next(0xDC00, 0xE000));
      highChar = Convert.ToChar(random.Next(0xD800, 0xDC00));
      tw.WriteSurrogateCharEntity(lowChar, highChar);
}

tw.WriteEndAttribute();
tw.WriteEndElement();
tw.Flush();
tw.Close();

XmlTextReader r = new XmlTextReader("SurrogatePair.xml");

r.Read();
r.MoveToFirstAttribute();
targetFile = new FileStream("SurrogatePairFromReader.xml",
       FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);

tw = new XmlTextWriter(targetFile, null);
tw.Formatting = Formatting.Indented;
tw.WriteStartElement("root");
tw.WriteStartAttribute("test", null);
tw.WriteString(r.Value);
tw.WriteEndAttribute();
tw.WriteEndElement();
tw.Flush();
tw.Close();

// Load both result files into DOM and compare.
XmlDocument doc1 = new XmlDocument();
XmlDocument doc2 = new XmlDocument();
doc1.Load("SurrogatePair.xml");
doc2.Load("SurrogatePairFromReader.xml");
if (doc1.InnerXml != doc2.InnerXml) {
      Console.WriteLine("Surrogate Pair test case failed");
}

When writing using the WriteChars method, which writes a buffer of data at a time, there is the possibility of a surrogate pair in the input accidentally being split across a buffer. Since surrogate values are well defined, if the WriteChars encounters a Unicode value from either the lower range or the upper range, it identifies that value as one half of the surrogate pair. When the situation is encountered where the WriteChars would result in a write from the buffer splitting a surrogate pair, an exception is thrown. To continue writing the next surrogate pair character to the output buffer this exception must be caught. The following example shows the use of the WriteChars method using a randomly generated surrogate pair character, which is being split when written to the output buffer.

using System;
using System.Xml;

class SurrogateSplit {

    static void Main() {

        char [] charArray = new char[4];
        char lowChar, highChar;
        Random random = new Random();

        lowChar = Convert.ToChar(random.Next(0xDC00, 0xE000));
        highChar = Convert.ToChar(random.Next(0xD800, 0xDC00));

        XmlTextWriter tw = new XmlTextWriter("test.xml", null);
        tw.WriteStartElement("Root");
        charArray[0] = 'a';
        charArray[1] = 'b';
        charArray[2] = 'c';
        charArray[3] = highChar;

        try{
            tw. WriteChars(charArray, 0, charArray.Length);
        } catch (Exception ex) {

            if (ex.Message.IndexOf("second character") > 0) {

                charArray[0] = highChar;
                charArray[1] = lowChar;
                charArray[2] = 'd';
                tw.WriteChars(charArray, 0, 3);
            }
        }
        tw.WriteEndElement();
        tw.Close();
    }
}

Catching the exception and continuing to write to the buffer ensures that the surrogate pair character is written correctly to the output stream.

See Also

Well-Formed XMLCreation with the XmlTextWriter | XML Output Formatting with XmlTextWriter | Namespace Features within the XmlTextWriter | Customized XML Writer Creation | XmlTextWriter Class | XmlTextWriter Members | XmlWriter Class | XmlWriter Members