2.1.1.1.1 LZ77 Compression Algorithm

The LZ77 Compression Algorithm is used to analyze input data and determine how to reduce the size of that input data by replacing redundant information with metadata. Sections of the data that are identical to sections of the data that have been encoded are replaced by a small amount of metadata that indicates how to expand those sections again. The encoding algorithm is used to take that combination of data and metadata and serialize it into a stream of bytes that can later be decoded and decompressed.

Compression Algorithm Terminology

The following terms are associated with the compression algorithm. Some of the terms also apply to the DIRECT2 Encoding Algorithm defined in section 2.1.1.1.2.

input stream: The sequence of bytes to be compressed.

byte: The basic data element in the input stream.

coding position: The position of the byte in the input stream that is currently being coded (the beginning of the lookahead buffer).

lookahead buffer: The byte sequence from the coding position to the end of the input stream.

window: A buffer of size W that indicates the number of bytes from the coding position backward. The window is empty at the beginning of the compression, then grows to size W as the input stream is processed. Once it reaches size W, it then "slides" along with the coding position.

pointer: Information about the starting offset of the match in the window (referred to as "B" in the example later in this section) and its length (referred to as "L" in the example later in this section). The starting offset is expressed as the count of bytes from the coding position backwards into the window. The length is the number of bytes to read forward from the starting offset.

The length expressed by the pointer can be longer than the starting offset. This indicates that the match repeats by returning back to the starting offset after reaching the coding position.

A null pointer indicates no match and is expressed as a starting offset of 0 and a length of 0.

match: The string that is used to find a match of the byte sequence between the lookahead buffer and the window.

Using the Compression Algorithm

To use the LZ77 Compression Algorithm:

  1. Set the coding position to the beginning of the input stream.

  2. Find the longest match in the window for the lookahead buffer.

  3. If a match is found, output the pointer P. Move the coding position (and the window) L bytes forward.

  4. If a match is not found, output a null pointer and the first byte in the lookahead buffer. Move the coding position (and the window) one byte forward.

  5. If the lookahead buffer is not empty, return to step 2.

Compression Process

The compression algorithm searches the window for the longest match with the beginning of the lookahead buffer and then outputs a pointer to that match. Because even a 1-byte match might not be found, the output cannot only contain pointers. The compression algorithm solves this problem by outputting after the pointer the first byte in the lookahead buffer after the match. If no match is found, the algorithm outputs a null-pointer and the byte at the coding position.

Compression Process Example

The following table shows the input stream that is used for this compression example. The bytes in the input, "AABCBBABC", occupy the first nine positions of the stream.

Input stream

 Position    1    2    3    4    5    6    7    8    9
 Byte        A    A    B    C    B    B    A    B    C

The following table shows the output from the compression process. The table includes the following columns:

Step: Indicates the number of the encoding step. A step in the table finishes every time that the encoding algorithm makes an output. With the compression algorithm, this process happens in each pass through step 3.

Position: Indicates the coding position. The first byte in the input stream has the coding position 1.

Match: Shows the longest match found in the window.

Byte: Shows the first byte in the lookahead buffer after the match.

Output: Presents the output in the format (B,L)C, where (B,L) is the pointer (P) to the match. This gives the following instructions to the decoder: Go back B bytes in the window and copy L bytes to the output. C is the explicit byte.

Note One or more pointers might be included before the explicit byte that is shown in the Byte column. That is, a metadata pointer does not always need to be followed by an explicit byte. An input stream of "ABCABCCCC", for example, can be represented as "(0,0)A(0,0)B(0,0)C(3,3)(1,3)" using the (B,L)C notation, with the last two elements being pointers without explicit bytes. The compressed output can be any combination of pointers and explicit bytes.

Compression process output

Step

Position

Match

Byte

Output

1.

1

--

A

(0,0)A

2.

2

A

--

(1,1)

3.

3

--

B

(0,0)B

4.

4

--

C

(0,0)C

5.

5

B

--

(2,1)

6.

6

B

--

(1,1)

7.

7

A B C

--

(5,3)

The result of compression, conceptually, is the output column—that is, a series of bytes and optional metadata that indicates whether that byte is preceded by some sequence of bytes that is already in the output.

Because representing the metadata itself requires bytes in the output stream, it is inefficient to represent a single byte that has previously been encoded by two bytes of metadata (offset and length). The overhead of the metadata bytes equals or exceeds the cost of outputting the bytes directly. Therefore, the protocol considers sequences of bytes to only be a match if the sequences have three or more bytes in common.

Decompression Process

The decompression algorithm processes the compressed stream from start to end. For each null pointer, it appends the associated byte directly to the end of the output stream. For each non-null pointer, it reads back to the specified offset from the current end of the output stream and appends the specified number of bytes to the end of the output stream.

Decompression Process Example

The input stream for this example is the output of the compression example above.

The following table shows the construction of the output stream as it is built from the sequence of pointers in the input stream. The table includes the following columns:

Step: Indicates the number of the decoding step. A step in the table finishes every time the decoding algorithm appends the set of bytes identified by the pointer to the output stream.

Input Pointer: The next pointer from the input stream.

Append Bytes: The bytes that the pointer identifies to be appended to the output stream.

Output Stream: The output stream as it looks at the end of each step.

Step

Input Pointer

Append Bytes

Output Stream

1.

(0,0)A

A

A

2.

(1,1)

A

A A

3.

(0,0)B

B

A A B

4.

(0,0)C

C

A A B C

5.

(2,1)

B

A A B C B

6.

(1,1)

B

A A B C B B

7.

(5,3)

ABC

A A B C B B A B C