Share via


2.7.1 List Document Set

The list document set scheme is efficient when iterative operations across document identifiers are necessary. In the list document set scheme the WID file contains a header and stores the document identifiers as a list. The following is a high-level representation of the format of the file.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Header (4096 bytes)

...

...

Array of DocIDs (variable)

...

Header (4096 bytes): The header is in the following format.


0


1


2


3


4


5


6


7


8


9

1
0


1


2


3


4


5


6


7


8


9

2
0


1


2


3


4


5


6


7


8


9

3
0


1

Type of scheme

Bdate

Flag

Outdated DocIDs

Reserved1

Number of Hint Pages

Hint page size

Number of DocIDs

Minimum DocID Value

Maximum DocID Value

Number of DocIDs Delta

Reserved2 (2004 bytes)

...

...

Hint Array (variable)

...

Reserved3 (variable)

...

Type of scheme (4 bytes): A 32-bit unsigned integer. Value MUST be 0x00000001.

Bdate (4 bytes): A 32-bit unsigned integer assigned during the creation of the file which is used to indicate order of file creation. The larger the number, the more recent the file.

Flag (4 bytes): A 32-bit unsigned integer. The most significant bit of this integer MUST be set to zero if all instances of the items in the file are outdated in all older files (that is, all files with a lower Bdate field). Otherwise, the most significant bit of the integer MUST be set to 1. Other bits MUST be ignored.

Outdated DocIDs (4 bytes): A 32-bit unsigned integer which represents the count of outdated document identifiers (1) in the file. This integer is used for estimation purposes to determine the efficient document identifiers (1) representation format during further merges. This value SHOULD be within 10% of the correct value. If the integer is not within this range, performance could be affected.

Reserved1 (4 bytes): The value of these 4 bytes is arbitrary, and MUST be ignored.

Number of Hint Pages (4 bytes): A 32-bit unsigned integer which determines how many entries are in the Hint Array. The value MUST NOT exceed 512. A value of zero indicates there are not enough document identifiers (1) to warrant this optimization.

Hint page size (4 bytes): A 32-bit unsigned integer. Number of document identifiers (1) in each Hint Page. The size of the last hint page is variable. A value of zero indicates there are not enough document identifiers (1) to warrant this optimization.

Number of DocIDs (4 bytes): A 32-bit unsigned integer which is the total number of document identifiers (1) stored in the file.

Minimum DocID Value, Maximum DocID Value (4 bytes each): Two 32-bit unsigned integers. Recorded at the time of file creation, no updates, used to check the density of the list of document identifiers.

Number of DocIDs Delta (4 bytes): A 32-bit unsigned integer which is the number of outdated DocIDs at the moment of file creation.

Reserved2 (2004 bytes): The value of these 2,004 bytes is arbitrary, and MUST be ignored.

Hint Array (variable): An array of 32-bit integers. Contains the first document identifier for every hint page. The most significant bit of each document identifier in the array MUST be set to 1 if any of the document identifiers on the corresponding hint page are outdated.

Hint Pages is a structural concept that is used to organize document identifiers (1) in the file. The array of document identifiers (1) stored in the file is split into hint pages. The first document identifier (1) of each page is used as a marker of the entire hint page.

Reserved3 (variable): The value of this field is arbitrary, and MUST be ignored.

Array of DocIDs (variable): Array of 32-bit integers. The list of document identifiers (1) sorted by increasing value. Each document identifier (1) has a size of 4 bytes. The most significant bit is set to 1 if the item is outdated, and set to 0 if the item is fresh.