CJK (Chinese, Japanese and Korean) specific tokenization tasks by using Windows PowerShell (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese)

Articolo
03/09/2011

Aggiornato: 10 febbraio 2011

In FAST Search Server 2010 for SharePoint you can influence the default tokenization using two methods: linguistic tokenization and substring tokenization.

Linguistic tokenization
Linguistic tokenization means that a string of text is split into individual tokens based on language-specific rules. For East Asian languages, you can influence tokenization by creating custom dictionaries. If words are missing from the system dictionary provided by FAST Search Server 2010 for SharePoint, for example technical terms, person names or company names, or if the default tokenization is incorrect, you can add words to the custom dictionary to ensure that they are tokenized as required.

Substring tokenization
Substring tokenization is especially useful for applications where recall is very important. Substring tokenization removes all spaces from the text and then splits it into bigrams (overlapping two character long tokens). For example, "アメリカ" (America) is split up into: ア,アメ,メリ,リカ (a, ame, meri, ca).

Substring tokenization increases recall, but reduces precision and greatly increases the size of the index. You should not use substring tokenization if it is more important to keep the index size down than to increase recall. To minimize the drop in precision, you can use a combination of substring tokenization and linguistic tokenization.

Suggerimento:
If you use substring tokenization only without any linguistic tokenization, the precision will be very low. Typically hits in the linguistic tokenized index are ranked higher than hits in the substring tokenized index, so a combination of the two tokenization types will give the best precision and recall.

In this article:

Create a custom dictionary for East Asian word breakers (linguistic tokenization)
Configure substring tokenization
Configure a combination of linguistic tokenization and substring tokenization

Create a custom dictionary for East Asian word breakers (linguistic tokenization)

Verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
In a text editor, open a new file.
Enter your custom dictionary words according to the following rules:
- The first line must be #CUSTOMER_WB
- Do not use the # character as the first character of an entry.
- Do not use spaces in entries.
- Use a forward slash (/) to mark the token boundaries in a compound word. The Korean custom dictionary does not support marking entries as compounds.
  
  Examples:
  - English: book/shelf/cover
  - Japanese/Chinese: 朝鲜/民主/主義/人民/共和國
  - Thai: กระทรวง/มหาด/ไทย
- Use the following file names:
  
  Language File name
  
  Japanese
  
  custom0011.lex
  
  Korean
  
  includeKOR.txt
  
  Thai
  
  custom001e.lex
  
  Chinese Simplified
  
  custom0804.lex
  
  Chinese Traditional
  
  custom0404.lex
Save the file in UTF-16 encoding (with a byte order mark (BOM)). Save the custom dictionaries in the <FASTSearchInstallationFolder>\lib\nlg\ folder on each server where FAST Search Server 2010 for SharePoint search or item processing is performed (<FASTSearchInstallationFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch).
On the Start menu, click All Programs.
Click Microsoft FAST Search Server 2010 for SharePoint shell.
Click Microsoft FAST Search Server 2010 for SharePoint shell.
At the Windows PowerShell command prompt restart item processing and query processing by typing the following commands:
```
psctrl reset
nctrl restart qrserver
```
Re-crawl all content to enable the customization for the indexed content.

Language	File name
Japanese	custom0011.lex
Korean	includeKOR.txt
Thai	custom001e.lex
Chinese Simplified	custom0804.lex
Chinese Traditional	custom0404.lex

Nota

Words in the custom dictionaries are processed differently depending on the language.

For Chinese, Japanese and Thai, the words in the custom dictionary are given absolute priority during tokenization. If a word is found in the custom dictionary, the same word in a document is guaranteed to be tokenized as defined in the custom dictionary.

For Korean, the tokenization defined in the custom dictionary only increases the probability of the specific tokenization.

Configure substring tokenization

Use this procedure to disable linguistic tokenization and enable substring tokenization:

Verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
On the FAST Search Server 2010 for SharePoint administration server, open <FASTSearchInstallationFolder> components\admin-services\web.config in a text editor (<FASTSearchInstallationFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch). Enable automatic re-indexing of all items by setting the value of AllowIndexPurgeOnSchemaUpdate to yes.
Save the file.
On the Start menu, click All Programs.
Click Microsoft FAST Search Server 2010 for SharePoint.
Click Microsoft FAST Search Server 2010 for SharePoint shell.
At the Windows PowerShell command promt, type the following command(s):
```
$mp= Get-FASTSearchMetadataManagedProperty -name <ManagedPropertyName>
```
Where:
- <ManagedPropertyName> is the name of the managed property that you want to configure substring tokenization for, for example title.

View the properties of the managed property:

$mp

>> Name                                   :         Title
>> Description                            :         The title of the document
>> Type                                   :         Text
>> Queryable                              :         True
>> StemmingEnabled                        :         True
>> RefinementEnabled                      :         False
>> MergeCrawledPfopertiesAuthorityWeight  :         False
>> SubstringEnabled                       :         False
>> DeleteDisallowed                       :         True
>> MappingDisallowed                      :         False
>> MaxIndexSize                           :         1024
>> MaxResultSize                          :         64
>> DecimalPlaces                          :         3
>> SortableType                           :         SortableDisabled
>> SummaryType                            :         Dynamic

Disable stemming:
```
$mp.StemmingEnabled=0
```
Enable substring:
```
$mp.SubstringEnabled=1
```
Update the managed property:
```
$mp.Update()
```

Verify that the changes were made:

$mp

>> Name                                   :         Title
>> Description                            :         The title of the document
>> Type                                   :         Text
>> Queryable                              :         True
>> StemmingEnabled                        :         False
>> RefinementEnabled                      :         False
>> MergeCrawledPfopertiesAuthorityWeight  :         False
>> SubstringEnabled                       :         True
>> DeleteDisallowed                       :         True
>> MappingDisallowed                      :         False
>> MaxIndexSize                           :         1024
>> MaxResultSize                          :         64
>> DecimalPlaces                          :         3
>> SortableType                           :         SortableDisabled
>> SummaryType                            :         Dynamic

It can take several minutes before substring tokenization is enabled after you have run the Update() command. Wait several minutes before you feed any documents.

Re-crawl all content.

Configure a combination of linguistic tokenization and substring tokenization

To combine the two tokenization modes, you must create two full-text indexes and two sets of managed properties. One index/property pair will have substring tokenization enabled (SubstringEnabled=1) and one pair will have linguistic tokenization enabled (StemmingEnabled=1).

If you want the two managed properties to be assigned a different rank in the search results, the two managed properties must be mapped to different full-text indexes.

The following example procedure describes how to apply both substring tokenization and linguistic tokenization to a managed property:

Verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
On the FAST Search Server 2010 for SharePoint administration server, open <FASTSearchInstallationFolder>\components\admin-services\web.config in a text editor (<FASTSearchInstallationFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch). Enable automatic re-indexing of all items by setting the value of AllowIndexPurgeOnSchemaUpdate to yes.
Save the file.
On the Start menu, click All Programs.
Click Microsoft FAST Search Server 2010 for SharePoint.
Click Microsoft FAST Search Server 2010 for SharePoint shell.

At the Windows PowerShell command prompt, type the flowing commands(s).

Get-FASTSearchMetadataFullTextIndexMapping -ManagedProperty (Get-FASTSearchMetadataManagedProperty -name <ManagedProperty>)

>> ImportanceLevel:     :  7
>> ManagedProperty:     :  Title
>> FullTextIndex:       :  content

Where:

<ManagedProperty> is the name of the managed property that you want to configure linguistic- and substring tokenization for, for example title.

Create a new managed property that has the same content as the managed property above:
```
$mpSubstring=New-FASTSearchMetadataManagedProperty -name <ManagedProperty>Substring -type 1
```
Where:
- <Managed_Property> is the managed property that you want to use the content of to create a new managed property, for example title.

View the properties of the new managed property:

$titleSubstring

>> Name                                   :         titleSubstring
>> Description                            :         
>> Type                                   :         Text
>> Queryable                              :         True
>> StemmingEnabled                        :         False
>> RefinementEnabled                      :         False
>> MergeCrawledPfopertiesAuthorityWeight  :         False
>> SubstringEnabled                       :         False
>> DeleteDisallowed                       :         False
>> MappingDisallowed                      :         False
>> MaxIndexSize                           :         1024
>> MaxResultSize                          :         64
>> DecimalPlaces                          :         3
>> SortableType                           :         SortableDisabled
>> SummaryType                            :         Static

Disable linguistic tokenization for the new managed property:
```
$mpSubstring.StemmingEnabled=0
```
Enable substring tokenization for the new managed property:
```
$mpSubstring.SubstringEnabled=1
```
Update the new managed property:
```
$mpSubstring.update()
```
To ensure that the content of the two managed properties are the same, map all the crawled properties of the first managed property to the newly created managed property:
```
Get-FASTSearchMetadataCrawledPropertyMapping -name <ManagedProperty>| ForEach-Object { New-FASTSearchMetadataCrawledPropertyMapping -managedproperty $mpSubstring -crawledproperty $_}
```
Where:
- <ManagedProperty> is the managed property that you used to create the new managed property, for example title.
Create a new full-text index with lowercase characters only for the name of the full-text index:
```
$mpSubstringIndex= New-FASTSearchMetadataFullTextIndex -Name <NewFullTextIndex> -Description "Title Substring"
```
Where:
- <NewFullTextIndex> is the name of the new full text index, for example titlesubstringindex.

View the new full text index:

$mpSubstringIndex

>> Name                :   titlesubstringindex
>> Description         :   Title Substring
>> StemmingEnabled:    :   True
>> isDefault           :   False
>> DeleteDisallowed:   :   False

Disable linguistic tokenization for the new full-text index:
```
$mpSubstringIndex.StemmingEnabled=0
```
Update the new full text index:
```
$mpSubstringIndex.Update()
```

Verify that the new full text index was updated:

$mpSubstringIndex

>> Name                :   titlesubstringindex
>> Description         :   Title Substring
>> StemmingEnabled:    :   False
>> isDefault           :   False
>> DeleteDisallowed:   :   False

Map the newly created substring-enabled managed property to the newly created substring-enabled full-text index:
```
New-FASTSearchMetadataFullTextIndexMapping -ManagedProperty $<NewManagedProperty> -FullTextIndex $<NewFullTextIndex> -ImportanceLevel 1
```
Where:
- <NewManagedProperty> is the name of the newly created managed property, for example titleSubstring.
- <NewFullTextIndex> is the name of the newly created full text index, for example titlesubstringindex.
Create a new rank profile to contain both indexes:
```
$NewRankProfile=New-FASTSearchMetadataRankProfile -Name <NewRankProfile>
```
Where
- <NewRankProfile> is the name of the new rank profile, for example DualRankProfile.
Determine if your full-text index has a pre-defined rank component in the newly created rank profile:
```
$DualRankProfile.GetFullTextIndexRanks()

>> FullTextIndexReference    :  content
>> ProximityWeight           :  50
>> ContextWeight:            :  50
```
In a default installation there is one pre-defined rank component only, for the full-text index named content.
Create index rank components for the newly created full-text index that do not already have an index rank:
```
$mpSubstringIndex= get-FASTSearchMetadataFullTextIndex -Name <NewFullTextIndex>
$DualRankProfile.CreateFullTextIndexRankComponent($<NewFullTextIndex>)
```
Where:
- <NewFullTextIndex> is the name of the full text index you just created, for example titlesubstringindex.

Verify that both full-text indexes have a full-text index rank component in the rank profile:

$DualRankProfile.GetFullTextIndexRanks()

>> FullTextIndexReference    :  content
>> ProximityWeight           :  50
>> ContextWeight:            :  50

>> FullTextIndexReference    :  titlesubstringindex
>> ProximityWeight           :  140
>> ContextWeight:            :  50

The full-text index that contains the linguistically tokenized words (content) should be ranked higher than the full-text index with substring-tokenized words (titlesubstringindex).

To rank the full index that contains the linguistically tokenized words higher than the full-text index with substring-tokenized words, set the context weight of the substring-enabled full-text index lower than that of the linguistic-tokenized full-text index:
```
$ranks = $DualRankProfile.GetFullTextIndexRanks()
$ranks|Where-Object -filterscript {$_.FullTextIndexReference.Name -eq "titlesubstringindex"}|ForEach-Object {$_.ContextWeight=<ContextWeight>; $_.Update()}
```
Where:
- <ContextWeight> is the context weight you want the new full text index (titlesubstringindex) to have relative to the default full text index (content), for example 30.

Verify that the context weight was updated:

$DualRankProfile.GetFullTextIndexRanks()

>> FullTextIndexReference    :  content
>> ProximityWeight           :  50
>> ContextWeight:            :  50

>> FullTextIndexReference    :  titlesubstringindex
>> ProximityWeight           :  140
>> ContextWeight:            :  30

Cronologia delle modifiche

Data	Descrizione	Motivo
10 febbraio 2011	2011/02/07	Aggiornamento contenuto
16 settembre 2010	Pubblicazione iniziale

Condividi tramite

CJK (Chinese, Japanese and Korean) specific tokenization tasks by using Windows PowerShell (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese)

Create a custom dictionary for East Asian word breakers (linguistic tokenization)

Configure substring tokenization

Configure a combination of linguistic tokenization and substring tokenization

Cronologia delle modifiche

Risorse aggiuntive