2.7.4 Huffman Compression

2022-11-15

Huffman compression is the only type of compression that is used when strings are stored and compressed in a dictionary file. String data is not stored in the column data storage files but in a dictionary file—specifically, in an XM_TYPE_STRING dictionary file. (For more information about XM_TYPE_STRING dictionaries, see section 2.3.2.1.2. For more information about the per-page string store information in particular, see section 2.3.2.1.2.4.) The strings might be compressed. BLOBs are also stored in XM_TYPE_STRING dictionaries after they have been encoded by means of base64 encoding. Therefore, stored BLOBs are treated as single character set strings and can be successfully compressed by using Huffman compression, as well.

If any of the strings are compressed, the compression used is always classic Huffman encoding. Although Huffman encoding has many forms, the Huffman tree and algorithm—especially of classic Huffman encoding—is well known and will not be discussed in detail here.

In short, in Huffman encoding, the algorithm determines the probability that a particular symbol from the assigned Huffman alphabet will be encountered in the stream of symbols being encoded. Using this statistical knowledge of the frequencies of different symbols in the alphabet being used, a binary tree of 2 × N – 1 nodes can be constructed, where N represents the number of symbols used in the text being encoded. (Note that N can be the size of the Huffman alphabet or less than the size of the Huffman alphabet being used.) Symbols that have the greatest frequency of occurrence in the text to be encoded are assigned higher positions in the binary tree than symbols that have a lower frequency of occurrence. The binary tree that is constructed does not need to be a balanced binary tree.

Share via

2.7.4 Huffman Compression

Additional resources