2.4.1 Compression and Decompression

To preserve space, VBA uses data compression on a contiguous sequence of records on various streams. The data compression technique is run length encoding.

The compression algorithm repeatedly reads 4096 bytes from the decompressed buffer into an array. Each group of 4096 bytes is called a chunk. The compression algorithm writes each 4096 byte chunk in an encoded and compressed format. Each output chunk is preceded by a two byte header which denotes the number of bytes in the chunk and the format of the chunk.

The compression algorithm searches for series of bytes that are repeated within the chunk. When series with multiple occurrences are found, the bytes in the first occurrence are encoded as literal tokens and the remaining occurrences are encoded as copy tokens which reference the first occurrence. The encoding for a repeated series of bytes is two bytes in length, thus matches of three bytes or more are required for encoding to be beneficial. Tokens are organized into groups of eight called a Token Sequence, which includes a flag byte. The flag byte is written in advance of the eight tokens. Each bit in the flag byte is used to identify the type of one of the token.

If the compression algorithm fails in producing enough copy tokens to compensate for the space overhead of the copy tokens and the flag bytes, the 4096 byte input chunk is written to the output chunk without any encoding.

The decompression algorithm reads one compressed chunk at a time. Each compressed chunk is decoded into 4096 bytes of uncompressed data which is written to output. For each chunk, the size and format style are extracted from the chunk header. The chunk is then read and decoded according to the format specified in the header.

When the chunk header format specifies that the chunk contains no copy tokens, the 4096 remaining bytes are copied to output. When the chunk header format specifies that copy tokens exist in the chunk, the Token Sequences are decoded. Literal tokens are copied to output. Copy tokens are decoded to find the first occurrence of the byte sequence the copy token represents which is then copied to output.

The pseudocode and record specifications for Compression and Decompression use the following conventions.

  • LEFT SHIFT: Bits in the operand are moved from the least significant to the most significant positions. High order bits are truncated. Low order bits become zero.

  • RIGHT SHIFT: Bits in the operand are moved from the most significant position to the least significant positions. Low order bits are truncated. High order bits become zero.

  • A literal bit sequence is denoted with the initial characters 0b. For example, the literal constant 0xB721 would appear as the binary literal 0b1011011100100001.