Share via


2.7.4.2 Conceptual Overview of a Huffman Tree

To demonstrate how Huffman encoding is used, this section provides a conceptual overview. The field names uiDecodeBits and encodeArray come from the XM_TYPE_STRING hash data dictionary (section 2.3.2.1.2).

Assume that a data column containing two strings representing gender exists. The strings are "Female" and "Male" and appear to use the ASCII character set. Internally, they actually use Unicode because Unicode contains the ASCII character set. The strings are stored in a dictionary and are Huffman encoded. The dictionary page that contains the encoded strings has the character set (a value of zero, which indicates English) and indicates that the strings are using single character set Huffman mode. The page also indicates that the value of the lookup table decode bits (uiDecodeBits) is 3, which means that the characters in the strings were most likely encoded with 3-bit codewords and 2-bit codewords (see section 2.7.4.1.2). Single-bit codewords are not allowed.

The dictionary also has an encoded array (encodeArray) of the Huffman alphabet used. The indexes of this array correspond to the Huffman alphabet symbols (byte symbols), and the content of each indexed element indicates the codeword length that is used for that symbol (byte) in the encoded string. The symbols themselves are bytes, not characters, and vary in value from 0 through 255, which is also why the size of the Huffman alphabet array is fixed at 256 elements.

In this conceptual overview, almost all the array elements are equal to zero (that is, the codeword size equals zero), which means that those byte symbols are not used, except for the elements at the following indexes:

  • The element at index 70 contains a value of 3.

  • The element at index 77 contains a value of 3.

  • The element at index 97 contains a value of 3.

  • The element at index 101 contains a value of 2.

  • The element at index 108 contains a value of 2.

  • The element at index 109 contains a value of 3.

Note that the preceding index values (70, 77, 97, 101, 108, and 109) correspond to the ASCII numeric character values for the letters, ‘F’, ‘M’, ‘a’, ‘e’, ‘l’, and ‘m’.

Because the strings are using English (ASCII), this behavior is expected because a one-to-one relationship exists between the byte value and the character value in ASCII (or in Unicode using this character set). This would not be the case if the character set used needed more than one byte per character.

The dictionary also has the encoded strings as a bit stream. Note that this is a bit stream, not a byte stream. In this case, the encoded bit stream will be decoded as the string "FemaleMale", which will then be a continuous byte (not bit) stream of decoded characters. Remember that the character set upper byte will be added back to each of these bytes. The bit stream itself is 25 bits in length with the following sequence ‘1000011111001001011100100’.

The following diagram shows a classic Huffman tree that is created by using the frequency of occurrence for each character. Frequencies (from the original string stream "FemaleMale") are shown in the diagram.

However, this tree also shows that symbols with 2-bit codewords are at one level (the top level that is allowed), symbols with 3-bit codewords are at the next level down, and so on through all the codeword values. In this case, only 2-bit and 3-bit codeword levels exist. The symbol byte values (101 and 108 on the 2-bit level, and the next four symbol byte values on the 3-bit level) are placed according to a classic Huffman implementation, where the lowest value goes to the leftmost node, the next-highest value goes to the right of that node, and so on across the level.

Huffman tree example

Figure 1: Huffman tree example

In the preceding diagram, walking down the tree—either left (0) or right (1) to each leaf node—shows how the codewords for each character are generated. Therefore, after constructing a Huffman tree, the resulting codewords can be used to decode the bit stream that contains the encoded strings. The bit stream is purely a continuous stream of bits. Errors could result if this bit stream is treated as a series of integers, words, or other types because doing so could induce format errors.