Linguistic and Unicode Considerations

2018-05-31

Note

Indexing Service is no longer supported as of Windows XP and is unavailable for use as of Windows 8. Instead, use Windows Search for client side search and Microsoft Search Server Express for server side search.

This section contains a list of linguistic and Unicode considerations that might affect word breaker and stemmer implementation. The list is not an exhaustive one.

This section includes the following topics:

Surface Form Normalization describes surface forms that word breakers and stemmers may normalize to.
Phrase Identification describes how word breakers identify phrases in text.
Agglutinative Languages describes stemming considerations for agglutinative languages.
Numbers and Dates describes how word breakers and stemmers handle numbers and dates.
Compound Words describes how word breakers and stemmers handle compound words.
Compound Phrases describes how word breakers and stemmers handle compound phrases.
Special Characters and Words describes how word breakers and stemmers handle special words and characters.
Acronyms and Abbreviations describes how word breakers and stemmers handle acronyms and abbreviations.
Capitalization describes how word breakers and stemmers handle capitalization.
Nonbreaking Spaces describes how word breakers handle nonbreaking spaces.
Surrogate Pairs describes Unicode surrogate pairs and using surrogate pairs to extend the Unicode character set to accommodate different character sets.

Share via

Linguistic and Unicode Considerations

Additional resources