Share via


Language Detection and Fallback Language for FAST Search for SharePoint 2010

Language Detection and Fallback Language for FAST Search for SharePoint 2010

While going through the document processing pipeline, FAST Search Server for SharePoint 2010 recognizes more than 80 different languages in common in encodings. In general the metadata belongs to the document contains all the details relates to the text language and the encoding style used. However if it cannot find in metadata then the text language will get determined automatically in side the document processing pipeline in language detector stage. This specific language information is used to select the appropriate language-specific dictionaries and algorithms during item processing.

But in some cases the automatic detection will not be able to determine the language and character set of an incoming document in cases where the incoming document does not have enough characters to detect the correct language or the document contains more than one language in side to represent the text information. To overcome this issue Microsoft has released the April 2012 Cumulative Update for FAST Search for SharePoint 2010 which allows you to define a fallback language and character set when automatic detection of the language and encoding fails.

URL for April CU 2012 https://support.microsoft.com/kb/2598329

In general if automatic detection is not possible, the fallback language is set to unknown and default character set is set to iso-5589-1. But with the April CU 2012 introduced the opportunity to set the fallback language manually. As a result of "unknown" fallback language, will observer that the expected results are not getting returned.  For an example if the fallback language detected as "unknown" and indexed and will assume this document contains some content written in "Russian Language" and in FAST Search Center some one querying for a word exists in side this document. (against a Russian word) What will happen is the query server will not return expected results even though the document and its relevant contents were indexed already. The reason is it cannot match any document which contains the search keyword and labeled as a document which contains "Russian Language" text.

However as mentioned earlier with April CU 2012 users have the ability to set the default fallback language as they wish when they are certain that what type of language texts they are planning to index and use in side their FAST Search environment.

Following are the steps to follow to set the default fallback language manually,

1) On the FAST Administration node, create a new XML file called LanguageAndEncodingDetectionProperties.xml under %FASTSEARCH%\etc\config_data\DocumentProcessor\ containing the following:

    <properties>
      <language fallback=”<YOUR_FALLBACK_LANGUAGE>” />
      <charset fallback=”<YOUR_FALLBACK_ENCODING>” />
    </properties>

    For the fallback language, please use the code identified for the language at the following URL: Linguistic features per language (FAST Search Server 2010 for SharePoint) . For the fallback character set, please use the preferred MIME name for the encoding specified here: Character Sets

    Note: You must choose both a fallback language and a fallback character set.

2) Save your configuration file.

3) Once your new configuration is ready, issue the following command on every server hosting document processors:

      psctrl reset

4) Reefed your documents.

Also if any one plans to disable or update the fallback language and relevant char set, following is the recommended way of doing it,

1.On your FAST administrative node update or delete %FASTSEARCH%\etc\config_data\DocumentProcessor\LanguageAndEncodingDetectorProperties.xml as appropriate.
2.Restart the configserver: nctrl restart configserver

3.This step and all subsequent steps must be run on each server running document (item) processors. Identify all running procservers by running the following: nctrl status

4.Stop all processors: nctrl stop procserver_<id>
   Where <id> is an integer value. You must run this command for every procserver entry.

5.Delete the directory %FASTSEARCH%\var\configserver-cache\procserver.

6.Start all procservers: nctrl start procserver_<id>