2.3.1 ContentIndexRecord

A content index record encodes a content index key and a list of integers representing document identifiers. The document identifiers MUST be stored in increasing order as an incremental change from the previous document identifier. There MUST be no duplicates. For each document identifier, the position of all instances of the term associated with the content index key in the corresponding property of the item pointed to by the property identifier MUST be recorded in a list of occurrences. For content index records with a large number of document identifiers, an extra list of document identifiers is stored as necessary. This list MUST contain a subset of document identifiers for the current content index record that has the highest rank value for the current content index key / property identifier pair.

The content index key MUST be encoded as an incremental change from the previous content index key value in the content index file. Prefix Length MUST be equal to the number of bytes that are in the previous content index key. Suffix Length MUST be equal to the number of bytes that are different, and follow directly after the prefix bytes. For the first content index record in a content index file, Prefix Length MUST be zero. The total length of the current content index key MUST be equal to Prefix Length + Suffix Length.

The content index record format is defined in the following table. Each field is present unless specified otherwise.

Name

Size

Type

Link

20 bits

BitStream field

Prefix/Suffix Length

Variable

PrefixSuffixCompress

SuffixValue

Variable

Suffixbyte[suffix length]

Pid

Variable

PidCompress

DocIDCount

Variable

DocIDCountCompress

IsSBRIPresent

1 bit

BitStream field

SBRIOffset

32 bits

BitStream field

AverageDocIDbitcount

5 bits

BitStream field

logCDocIDs

5 bits

BitStream field

SkipsPage<3>

32 bits

BitStream field

SkipsOffset<4>

32 bits

BitStream field

IsCIXLinkPresent<5>

1 bit

BitStream field

CIXPage<6>

32 bits

BitStream field

CIXOffset<7>

32 bits

BitStream field

ContentDocIDsData

Variable

ContentDocIDData[DocIDCount]

Padding_dword_align

Variable

BitStream field

SBRIData

Variable

SBRIData[n]

DocIDSkipCount<8>

Variable

BitCompress(9)

DocIDSkipsData<9>

Variable

DocIDSkipData[DocIDSkipCount]

AllItems

Variable

AllItems

Content index record fields:

Link (20 bits): Stores the size of the content index record in bits. The field value MUST be zero if the size of the content index record is greater than 2^20 bits or if the current record is the max key.

Prefix/Suffix Length (variable): Contains Prefix Length and Suffix Length. The sum of these 2 values MUST NOT exceed 129. Prefix Length MUST NOT exceed the sum of Prefix Length and Suffix Length for the previous content index record. Prefix Length MUST be zero for the first content index record in the content index file.

SuffixValue (variable): MUST contain suffix length bytes. Each byte MUST be read as a BitStream field (size 8 bits) from BitStream; these are the modified bytes from the previous content index key.

Pid (variable): MUST contain the value of the property identifier associated with the content index key.

DocIDCount (variable): MUST contain the total count of document identifiers in the content index key. MUST NOT be present if the current index key is the max key.

IsSBRIPresent (1 bit, optional):

  • MUST NOT be set if log2 ( DocIDCount)* 1024 >= DocIDCount

  • MUST NOT be present if the format version is 0x54.

  • MUST NOT be set if the current content index record contains the EOF key.

  • MUST be set only if SBRIData is present for the content index record.

  • MUST NOT be present if the current index key is the Max key.

  • MUST NOT be set if the current content index record contains the BOF key.<10>

SBRIOffset (32bits, optional): Number of DWORDs (see [MS-DTYP]) to skip in BitStream from the beginning of this field to the position in the BitStream at the beginning of SBRIData field. SBRIOffset MUST NOT be present if the IsSBRIPresent field bit is not set. BitStream MUST be aligned up to the nearest DWORD before reading the SBRIData field.

  • MUST NOT be present if the format version is 0x54.

  • MUST NOT be present if the current content index key is the max key.

AverageDocIDbitcount (5 bits): Defines the average number of bits to use for document identifier (1) storage. MUST NOT be present if the current index key is the max key.

logCDocIDs (5 bits, optional): Parameter that defines the frequency of the DocID skips and how many bits each DocID skip takes. No DocID skips are used for current content index record if the logCDocIDs field is zero.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

SkipsPage (32 bits, optional):<11>

  • 32-bit number of the page in the current content index file that contains the beginning of the DocID skips data for the current content index record.

  • MUST NOT be present if the logCDocIDs field equals zero.

  • MUST NOT be present if the format version is less than 0x54.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

SkipsOffset (32 bits, optional):<12>

  • 32-bit value of the offset on a page in the current content index file that contains the beginning of the DocID skips data for the current content index record.

  • MUST NOT be present if the logCDocIDs field equals zero.

  • MUST NOT be present if the format version is less than 0x54.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

IsCIXLinkPresent (1 bit, optional):<13> If this bit is set, this content index record MUST contain a link to the document identifier information in the corresponding .cix file. MUST NOT be present if the format version is 0x52.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

CIXPage (32 bits, optional):<14>

  • MUST NOT be present if the IsCIXLinkPresent field bit is not set.

  • MUST NOT be present if the format version is 0x52.

  • MUST contain the 32-bit value of a page in the CIX file that contains the beginning of the index extension data for the current content index record.

  • If the CIXPage field equals 0xffffffff, the CIX link is not valid and index extension information is not available for the current content index record.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

CIXOffset (32 bits, optional):<15>

  • MUST NOT be present if the IsCIXLinkPresent field bit is not set.

  • MUST NOT be present if the format version is 0x52.

  • MUST contain the 32-bit value of the offset on a page in the CIX file that contains the beginning of the index extension data for the current content index record.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

ContentDocIDsData (Variable, optional): Stores document identifiers for the given content index key. Contains DocIDCount ContentDocIDData records numbered from zero to DocIDCount -1.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC9.

Padding_dword_align (variable):

  • A variable length field to align the next field on 32-bit boundary.

  • The value of this field is arbitrary, and MUST be ignored.

  • MUST NOT be present if the format version is 0x54.

  • MUST NOT be present if the IsSBRIPresent field is not set.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

SBRIData (variable):

  • This field MUST store the highest-ranked document identifiers sorted in ascending order by document identifier. The document identifier rank MUST be calculated as follows:

    • fRank = 0.05*cOcc/ (0.25 + ( 0.75 * maxoccur / AvdlThisPid ))

    • where cOcc is the total number of occurrences of the current search query term in an item for the current property identifier, maxoccur is a Max Occurrence, as defined in the MaxOccBuckets table, as specified in section 2.1.2, and AvdlThisPid is a cAvgOcc field, as defined in section 2.8.1, for the current property identifier.

  • SBRIData field MUST contain (log2 (DocIDCount)* 1024) document identifiers with the maximum fRank of all document identifiers for this content index record.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

DocIDSkipCount (Variable, optional)<16> (BitCompress(9)):

  • Number of DocIDSkipData records for the current content index record.

  • MUST NOT be present if the logCDocIDs field equals zero.

  • MUST NOT be present if the format version is less than 0x54.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

DocIDSkipsData (Variable, optional):<17>

  • MUST NOT be present if the logCDocIDs field equals zero.

  • MUST NOT be present if the format version is less than 0x54.

  • Stores DocID skips for the current content index record. Contains DocIDSkipCount DocIDSkipData records numbered from zero to DocIDSkipCount -1. Each DocIDSkipData record defines the relative position of the document identifier in the ContentDocIDsData[DocIDCount] structure.

  • MUST NOT be present if the current index key is the max key.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

AllItems (Variable, optional):

  • Contains a set of all document identifiers that are present in content index records with the same key but different property identifiers.

  • MUST NOT be present if Pid is not equal to 0x7ffeFFC9.

  • MUST NOT be present if the current content index record contains the EOF key.

  • MUST NOT be present if the current content index record contains the BOF key.

The AllItems structure is defined by the following table:

Name

Size

Type

Version

4 bits

BitStream field

DocIDMask

256 bits

BitStream field

DocIdBitmapSize

32 bits

BitStream field

Padding

Variable

BitStream field

DocIdBitmap

DocIdBitmapSize bits

BitStream field

AllItems fields:

Version (4 bits): MUST be zero.

DocIDMask (256 bits): Each bit is numbered from zero to 255. The bit at position N MUST be set if there exists an item that is stored in the current content index record with a document identifier that has the low-order byte equal to N.

DocIdBitmapSize (32 bits):

  • Contains the total number of bits in the DocIdBitmap field.

  • MUST be equal to ((MaxBitMapId/256) * DocIdMaskDelta[256] + MaxBitMapIdDelta + 2). MaxBitMapId is the maximum value for the document identifier for the items stored in the current content index record. MaxBitMapIdDelta is the number of set bits in DocIDMask at positions less than the low-order byte of MaxBitMapId. DocIdMaskDelta[256] is the number of set bits in DocIDMask. The result of the division is rounded down before multiplication.

Padding (Variable, optional):

  • A variable length field to align the next field on a 32-bit boundary.

  • The value of this field MUST be ignored.

DocIdBitmap (Variable, optional):

  • MUST contain DocIdBitmapSize bits.

  • For each item stored in the current content index record, a bit at position ((DocId/256) * DocIdMaskDelta[256] + DocIdMaskDelta[N] + 1) MUST be set. DocId is the document identifier for the item. N equals the low-order byte of DocId. DocIdMaskDelta[N] is the number of set bits in DocIDMask at positions less than N. DocIdMaskDelta[256] is the number of set bits in DocIDMask. The result of the division is rounded down before multiplication.

  • All other bits MUST NOT be set.

A ContentDocIDData[n] record is defined by the following table, where n is from zero to (DocIDCount -1).

Name

Size

Type

DocIDSkipbits

logCDocIDs + 6 bits

BitStream field

DocIDSkip

log2 (DocIDMax) bits

BitStream field

DocIDDelta

Variable

BitCompress(AverageDocIDbitcount field+ 1)

MaxDocIDOccBucket

7 bits

BitStream field

AllPropertyRank

12 bits

BitStream field

OccCount

Variable

BitCompress(3)

OccSkip

9 + log2 (OccCount/16) bits

BitStream field

Padding_dword_align

variable

BitStream field

OccsDelta

Variable

BitCompress(7)[OccCount]

ContentDocIDData[n] fields:

DocIDSkipbits (logCDocIDs + 6 bits , optional):

  • MUST NOT be present if the format version is 0x54.

  • The field MUST NOT be present if n is not a multiple of the logCDocIDs field *4 or the logCDocIDs field is zero.

  • The field MUST be zero if DocIDCount <= n + logCDocIDs *4.

  • The field contains the number of bits from the beginning of the current record ContentDocIDsData[n] to the record ContentDocIDsData[n+ logCDocIDs*4].

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

DocIDSkip (log2 (DocIDMax) bits , optional):

  • MUST NOT be present if the format version is 0x54.

  • The field MUST NOT be present if n is not a multiple of logCDocIDs *4 or logCDocIDs is zero.

  • The field MUST be zero if DocIDCount <= n+ logCDocIDs *4.

  • The field contains a document identifier that is stored in ContentDocIDsData[n+ logCDocIDs *4] record. DocIDMax is a global parameter for the content index file.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

DocIDDelta (Variable): MUST store the incremental value between the previous and current document identifiers. If the current document identifier is the first in ContentDocIDsData, the actual document identifier MUST be stored. The value returned by BitCompress(AverageDocIDbitcount + 1) MUST be incremented by 1 before it is used as DocIDDelta.

MaxDocIDOccBucket (7 bits, optional):

  • MUST NOT be present if the current content index record contains the EOF key.

  • MaxDocIDOccBucket MUST be the MaxOccBucket for a document identifier and property identifier.

  • MUST NOT be present if the current content index record contains the BOF key.<18>

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

AllPropertyRank (12 bits, optional):

  • Contains a 12 bit unsigned integer that defines the relative rank of an item with the current document identifier for the term defined by the key in the current content index record.

  • MUST NOT be present if Pid is not equal to 0x7ffeFFC8.

  • MUST NOT be present if the current content index record contains the EOF key.

  • MUST NOT be present if the current content index record contains the BOF key.<19>

OccCount (Variable, optional):

  • Stores the number of occurrences for the current document identifier.

  • MUST NOT be present if the current content index record contains the EOF key.

  • OccCount is assumed to be equal to "1" in all other references in this section if the current content index record contains the EOF key.

  • MUST NOT be present if the current content index record contains the BOF key.<20>

  • OccCount is assumed to be equal to "1" in all other references in this section if the current content index record contains the BOF key.<21>

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

OccSkip (9 + log2 (OccCount /16) bits, optional):

  • Field MUST NOT be present if OccCount < 8.

  • MUST store sum of size Padding_dword_align and OccsDelta[OccCount] in bits.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

Padding_dword_align (Variable, optional):

  • A variable-sized field to align OccDelta[OccCount] on a 32-bit boundary.

  • The value of this field is arbitrary, and MUST be ignored.

  • MUST NOT be present if OccCount < 8.

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

OccsDelta(Variable, optional):

  • MUST store OccCount values encoded as a BitCompress(7), as specified in section 2.2.2.1.

  • If the current index key is not an EOF key, OccsDelta MUST contain occurrences in the current item. The first value is equal to the first occurrence minus 1. Each subsequent value is equal to the difference between the current and the previous occurrence minus 1.

  • If the current index key is not a BOF key, OccsDelta MUST contain occurrences in the current item. The first value is equal to the first occurrence minus 1. Each subsequent value is equal to the difference between the current and the previous occurrence minus 1.<22>

  • MUST NOT be present if Pid equals 0x7ffeFFC8 or 0x7ffeFFC9.

Example:

Value_1 = Occurrence_1 - 1

Value_2 = Occurrence_2 - Occurrence_1 - 1

Value_3 = Occurrence_3 - Occurrence_2 – 1

  • If the current index key is an EOF key and the property identifier is NOT 0x7FFEFFFF, OccsDelta MUST contain the maximum occurrence value for the current property identifier.

  • If the current index key is an EOF key and the property identifier is 0x7FFEFFFF, OccsDelta MUST contain the sum of the maximum occurrence values for all property identifiers.

  • If the current index key is a BOF key and the property identifier is NOT 0x7FFEFFFF, OccsDelta MUST contain the maximum occurrence value for the current property identifier.<23>

  • If the current index key is a BOF key and the property identifier is 0x7FFEFFFF, OccsDelta MUST contain the sum of the maximum occurrence values for all property identifiers.<24>

A SBRIData[n] record is defined by the following table, where n is from zero to (log2 ( DocIDCount)* 1024 -1).

Name

Size

Type

DocIDDelta

Variable

BitCompress(log2 (DocIDMax / (log2 (DocIDCount) * 1024)))

Rank

12 bits

BitStream field

SBRIData[n] fields:

DocIDDelta (Variable): MUST store the incremental value between the current document identifier and the previous document identifier. If the current document identifier is the first in SBRIData, the actual value MUST be stored. The value returned by BitCompress(7) MUST be incremented by 1 before it is used as DocIDDelta.

Rank (12 bits):

  • Contains 12 bits of ranking information.

  • If fRank for the current document identifier is >=1, the value of Rank MUST be equal to:

    • Min(0x7ff, (log(1.0 + ( fRank - 1.0 ) * dResolutionAdjust) / dLnDivider)) + 0x0fff

  • If fRank for the current document identifier is < 1, the value of Rank MUST be equal to:

    • Min(0x7ff,(log(1.0 + ( 1.0/fRank - 1.0 ) * dResolutionAdjust) / dLnDivider))

    • where ResolutionAdjust = 26612.566117305021291272917047288 and dLnDivider = 0.0099503308531680828482153575442607.

A DocIDSkipData [n] record is defined by the following table, where n is from zero to DocIDSkipCount -1.

Name

Size

Type

DocIDDelta

Variable

BitCompress( log2( logCDocIDs * 4) + AverageDocIDbitcount + 2)

DocIDSkipOffsetDelta

Variable

BitCompress( min( logCDocIDs +6, 32) )

IsDefaultDocIDSkip

1 bit

BitStream field

DocIdSkip

log2( logCDocIDs * 4) bits

BitStream field

DocIDSkipData [n] fields:

DocIDDelta (Variable):

  • Contains incremental value between the document identifier for the previous DocIDSkipData and the current one.

  • The value returned by BitCompress, as specified in section 2.2.2.1, MUST be incremented by 1 before it is used as DocIDDelta.

  • MUST contain the actual document identifier if n equals zero.

  • Document identifier MUST be present in one of ContentDocIDData records in the current content index record.

DocIDSkipOffsetDelta (Variable):

  • If n is greater than zero, the field MUST contain the number of bits from the beginning of the ContentDocIDsData[m] record to the beginning of record ContentDocIDsData[k], where ContentDocIDsData[m] stores the document identifier equal to the document identifier stored in DocIDSkipData [n - 1] and ContentDocIDsData[k] stores document identifier equal to the document identifier stored in DocIDSkipData [n].

  • If n equals zero, the field MUST contain the number of bits from the beginning of the ContentDocIDsData[0] record to the beginning of record ContentDocIDsData[k], where ContentDocIDsData[k] stores the document identifier equal to the document identifier stored in DocIDSkipData [n].

IsDefaultDocIDSkip (1 bit):

  • If n is greater than zero, the field MUST be "1" if k - m equals logCDocIDs * 4, where ContentDocIDsData[m] stores the document identifier equal to the document identifier stored in DocIDSkipData [n - 1] and ContentDocIDsData[k] stores the document identifier equal to the document identifier stored in DocIDSkipData [n].

  • If n equals zero, the field MUST be "1" if k equals logCDocIDs * 4, where ContentDocIDsData[k] stores index data for the document identifier equal to the document identifier stored in DocIDSkipData [n].

  • The field MUST be zero in all other cases.

DocIdSkip (log2(logCDocIDs * 4) bits, optional):

  • The field MUST NOT be present if IsDefaultDocIDSkip is "1".

  • If n is greater than zero, the field MUST contain the value k - m, where ContentDocIDsData[m] stores index data for the document identifier equal to the            document identifier stored in DocIDSkipData [n - 1] and ContentDocIDsData[k] stores the document identifier equal to the            document identifier stored in DocIDSkipData [n].

  • If n equals zero, the field MUST contain k, where ContentDocIDsData[k] stores document identifier equal to the document identifier stored in DocIDSkipData [n].