Share via


2.4.1 Zip Files

If the first 4 bytes of the file match the local file header signature, as specified in [PKWARE-Zip] section V, subsection A, then .ZIP analysis is used if the file is a valid ZIP file as specified in [PKWARE-Zip]. If the file does not comply with the ZIP format, RDC analysis as specified in section 2.4.2 or the simple chunking method as specified in section 2.4.3 is used.

.ZIP files are split into chunks based on information in each local file header, as specified in [PKWARE-Zip]. The analysis of local file headers produces file chunk boundaries at the start of the local file header as specified in [PKWARE-Zip], the start of the data file as specified in [PKWARE-Zip], and after the data file as specified in [PKWARE-Zip], producing two file chunks for each .ZIP item: the local file header chunk and the data file chunk.

The signature for the local file header chunk has the structure that is specified in the following diagram.<4>


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Local File Header Hash (20 bytes)

Local File Header Hash: A 20-byte sequence that specifies the SHA-1 hash code of the file bytes represented by the local file header chunk.

The signature for the data file chunk has the structure that is specified in the following diagram.<5>


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

CRC

Compressed Size (8 bytes)

Uncompressed Size (8 bytes)

CRC (4 bytes): An unsigned 32-bit integer that specifies the value of the local file header crc-32 field, as specified in [PKWARE-Zip].<6>

Compressed Size (8 bytes): An unsigned 64-bit integer that specifies the size, in bytes, of the data file chunk. It MUST be the value of the local file header compressed size field, as specified in [PKWARE-Zip], unless the local file header extra field, as specified in [PKWARE-Zip], includes a Zip64 Extended Information Extra Field, as specified in [PKWARE-Zip], in which case it MUST be the value of the compressed size field in the Zip64 Extended Information Extra Field, as specified in [PKWARE-Zip].

Uncompressed Size (8 bytes): An unsigned 64-bit integer that specifies the size, in bytes, of the uncompressed data represented by the bytes of the data file chunk. It MUST be the value of the local file header uncompressed size field, as specified in [PKWARE-Zip], unless the local file header extra field, as specified in [PKWARE-Zip], includes a Zip64 Extended Information Extra Field, as specified in [PKWARE-Zip], in which case it MUST be the value of the uncompressed size field in the Zip64 Extended Information Extra Field, as specified in [PKWARE-Zip].

If the combined size, in bytes, of the local file header chunk and the data file chunk is less than or equal to 4,096, a single chunk is produced with a signature that is the local file header chunk signature followed by the data file chunk signature. For protocol clients and servers with VersionNumberType, as specified in [MS-FSSHTTP] section 2.2.5.13, greater than or equal to 2 and MinorVersionNumberType, as specified in [MS-FSSHTTP] section 2.2.5.10, greater than or equal to 2, the signature for the single chunk is a bitwise exclusive OR of the signature bytes of the local file header chunk and the data file chunk. If the signatures are not of equal length, the extra bytes of the longer signature are appended to the end of the exclusive ORed bytes. 

The analysis of chunks into local file header and data file chunks continues at the file location after the current data file until one of the following conditions occurs:

  • The extent of the data file, as specified in the local file header, would extend past the end of the file.

  • A sequence other than a local file header signature, as specified in [PKWARE-Zip], is found.

If the analysis of ZIP local headers terminates without creating any chunks, the .ZIP analysis MUST NOT be used.

After the analysis of local file headers terminates, the remaining bytes in the file are represented by a final chunk.

If the total size, in bytes, of the final chunk is less than or equal to 1 megabyte, the signature for the final chunk has the structure that is shown in the following diagram.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Small Final Chunk Signature (20 bytes)

Small Final Chunk Signature: A 20-byte sequence that specifies the SHA-1 hash code of the file bytes represented by the final chunk.

If the total size, in bytes, of the final chunk is greater than 1 megabyte, the signature for the final chunk has the structure that is shown in the following diagram.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Large Final Chunk Signature (12 bytes)

Large Final Chunk Signature: A 12-byte sequence of bytes that specifies the chunk signature.<7> This sequence of bytes MUST be unique.

For each chunk in the chunk list, a Leaf Node Object, as specified in section 2.2.3, is created. The Data Size of the Leaf Node Object MUST be the total number of bytes represented by the chunk. The Signature Data of the Leaf Node Object MUST be the chunk’s signature. The Leaf Node is referenced by its parent node.

If the number of .ZIP file bytes represented by a chunk is greater than 3 megabytes, a list of subchunks is generated. Each subchunk represents a sequential chunk of the .ZIP file data. The size of each subchunk is at most 3 megabytes. All but the last subchunk SHOULD be 3 megabytes in size<8>. The total size of all the subchunks MUST equal the Data Size of the parent Intermediate Node Object.

The signature for these subchunks has the structure that is shown in the following diagram.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Sub Chunk Signature (8 bytes)

Sub Chunk Signature: An 8-byte sequence of bytes that specifies the subchunk signature.<9> This sequence of bytes MUST be unique.

For each subchunk, a Leaf Node Object, as specified in section 2.2.3, is created. The parent Intermediate Node Object of the subchunk MUST have its Object Reference Array include one Object ID entry for each subchunk, and these Object ID entries MUST be ordered based on the sequential .ZIP file bytes represented by each chunk.

For every Leaf Node Object that has a Data Size less than or equal to 1 megabyte, a Data Node Object MUST be created. The Object Data of the Data Node Object MUST be the byte sequence from the .ZIP file tracked by the chunk. The Object Reference Array and the Cell Reference Array of the Data Node Object MUST be empty. The Object References Array of the Leaf Node Object associated with this Data Node Object MUST have a single entry, which MUST be the Object ID of the Data Node Object.