Understanding Ranking
Full-Text Search in Microsoft SQL Server can generate a score (or rank value) that indicates the relevance of the data returned. This per-row rank value can be thought of as an ordering criteria, which can be used to sort the result set by relevance.
Note
The rank value holds relevance only for a given query and does not hold any significance across queries.
Statistics for Ranking
When an index is built, statistics are collected for use in ranking. The process of building a full-text catalog does not directly result in a single index structure. Instead, the Microsoft Full-Text Engine for SQL Server (MSFTESQL) service creates intermediate indexes as data is indexed. MSFTESQL then merges these indexes into a larger index as needed. This process can be repeated many times. MSFTESQL then conducts a "master merge" that combines all of the intermediate indexes into one large master index.
Statistics are collected at each intermediate index level. The statistics are merged when the indexes are merged. Some statistical values can only be generated during the master merging process.
While computing the ranking for a query result set, some of the statistics used could come from smaller intermediate indexes that satisfy the query and some from the master index. This depends on whether intermediate indexes have been merged or not. As a result, the ranking statistics can vary in accuracy if the intermediate indexes have not been merged. This explains why the same query can return different rank results over time as full-text indexed data is added, modified, and deleted, and as the smaller indexes are merged.
To minimize the size of the index and computational complexity, the statistics are often rounded.
The list below includes some commonly used terms and statistical values that are important in calculating rank:
- Property
A full-text indexed column of the row.
- Document
The entity that is returned in queries. In SQL Server this corresponds to a row. A document can have multiple properties, just as a row can have multiple full-text indexed columns.
- Index
A single inverted index of one or more documents. This may be entirely in memory or on disk. Many query statistics are relative to the individual index where the match occurred.
- Full-Text Catalog
A collection of intermediate indexes treated as one entity for queries. Catalogs are the unit of organization visible to the SQL Server administrator.
- Word, token or item
The unit of matching in the full-text engine. Streams of text from documents are tokenized into words, or tokens by language-specific word breakers.
- Occurrence
The word offset in a document property as determined by the word breaker. The first word is at occurrence 1, the next at 2, and so on. In order to avoid false positives in phrase and proximity queries, end-of-sentence and end-of-paragraph introduce larger occurrence gaps.
- Catalog Key
Combination of the word and the property that contains the word.
- HitCount
The number times the key value occurs in a row.
- IndexedRowCount
Total number of rows indexed. This is computed, based on counts maintained in the intermediate indexes. This number can vary in accuracy.
- KeyRowCount
Total number of rows in the full-text catalog that contain a given key.
- MaxOccurrence
The largest occurrence stored in a full-text catalog for a given property in a row.
- MaxQueryRank
The maximum rank, 1000, returned by the MSFTESQL.
Rank Computation Issues
The process of computing rank, depends on a number of factors. Different language word breakers tokenize text differently. For example, the string "dog-house" could be broken into "dog" "house" by one word breaker and into "dog-house" by another. This means that matching and ranking will vary based on the language specified, because not only are the words different, but so is the document length. The document length difference can affect ranking for all queries.
Statistics such as IndexRowCount can vary widely. For example, if a catalog has 2 billion rows in the master index, then one new document is indexed into an in-memory intermediate index, and ranks for that document based on the number of documents in the in-memory index could be skewed compared with ranks for documents from the master index. For this reason, it is recommended that after any population that results in large number of rows being indexed or re-indexed the indexes be merged into a master index using the ALTER FULLTEXT CATALOG ... REORGANIZE DDL statement. MSFTESQL will also automatically merge the indexes based on parameters such as the number and size of intermediate indexes.
MaxOccurrence values are normalized into 1 of 32 ranges. This means, for example, that a document 50 words long is treated the same as a document 100 words long. Below is the table used for normalization. Because the document lengths are in the range between adjacent table values 32 and 128, they are effectively treated as having the same length, 128 (32 < docLength <= 128).
{ 16, 32, 128, 256, 512, 725, 1024, 1450, 2048, 2896, 4096, 5792, 8192, 11585,
16384, 23170, 28000, 32768, 39554, 46340, 55938, 65536, 92681, 131072, 185363,
262144, 370727, 524288, 741455, 1048576, 2097152, 4194304 };
In cases where top_n_by_rank is used with the new precomputed rank option, some additional rounding may be done while computing rank, so there may be some difference in those values based on whether or not the precomputed rank option is enabled.
Ranking of CONTAINSTABLE
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )
Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )
Phrase matches are ranked just like individual keys except that KeyRowCount (the number of rows containing the phrase) is estimated and can be inaccurate and higher than the actual number.
Ranking of ISABOUT
ISABOUT is a vector-space query in traditional information retrieval terminology. The default ranking algorithm used is Jaccard, a widely known formula. The ranking is computed for each term in the query and then combined as described below.
ContainsRank = same formula used for CONTAINSTABLE ranking of a single term (above).
Weight = the weight specified in the query for each term. Default weight is 1.
WeightedSum = Σ[key=1 to n] ContainsRankKey * WeightKey
Rank = ( MaxQueryRank * WeightedSum ) / ( ( Σ[key=1 to n] ContainsRankKey2 )
+ ( Σ[key=1 to n] WeightKey2 ) - ( WeightedSum ) )
Ranking of FREETEXT
FREETEXT ranking is based on the OKAPI BM25 ranking formula. FREETEXT queries will add words to the query via inflectional generation (inflected forms of the original query words); these words are treated as separate words with no special relationship to the words from which they were generated. Synonyms generated from the Thesaurus feature are treated as separate, equally weighted terms. Each word in the query contributes to the rank.
Rank = Σ[Terms in Query] w ( ( ( k1 + 1 ) tf ) / ( K + tf ) ) * ( ( k3 + 1 ) qtf / ( k3 + qtf ) ) )
Where:
w is the Robertson-Sparck Jones weight.
In simplified form, w is defined as:
w = log10 ( ( ( r + 0.5 ) * ( N – R + r + 0.5 ) ) / ( ( R – r + 0.5 ) * ( n – r + 0.5 ) )
N is the number of indexed rows for the property being queried.
n is the number of rows containing the word.
K is ( k1 * ( ( 1 – b ) + ( b * dl / avdl ) ) ).
dl is the property length, in word occurrences.
avdl is the average length of the property being queried, in word occurrences.
k1, b, and k3 are the constants 1.2, 0.75, and 8.0, respectively.
tf is the frequency of the word in the queried property in a specific row.
qtf is the frequency of the term in the query.
See Also
Other Resources
CONTAINSTABLE (Transact-SQL)
FREETEXTTABLE (Transact-SQL)