Extending SQL 2005 Fulltext Search

Extending SQL 2005 fulltext search

Author: Shajan Dasan (Development Lead, Microsoft Search)

Date: 1/24/2005

 

Fulltext indexing consists of extracting text out of documents, breaking the text into individual words and storing the words in an index for fast lookup. A filter (IFilter) is a COM component that extracts text out of documents. SQL 2005 (SQL Server 2005) ships with many filters out of the box – e.g. filters capable of extracting text out of Word, PowerPoint, html. WordBreakers (IWordBreaker) are language specific COM components capable of separating words from text. SQL Server 2005 supports WordBreakers for many languages out of the box – e.g. English, German, Japanese. It is possible to extend SQL Server 2005 fulltext search by implementing custom Filters and WordBreakers. The IFilter and IWordBreaker interfaces are documented in MSDN. This document describes authoring and installing Filters and Wordbreakers in SQL Server 2005.

Component loading model

SQL Server 2005 component (IFilter/IWordBreaker) loading model is different from previous versions of SQL Server. There is a backward compatible mode, which is described later in admin settings. Changes were made to better isolate different instances of SQL Server 2005 and improve the security of the indexing process. Different instances of SQL Server 2005 can be installed on the same machine. A change in components of one instance does not affect other instances. To improve the security of the indexing process, Authenticode signing of components is required by default. Components signature is verified before it is loaded.

Fulltext Search component installation involves (a) Installing the binaries and resources to the SQL Server 2005 instance folder and (b) Optionally installing the Authenticode certificates of the component publisher if the publisher is not trusted on the machine (c) Updating the instance specific registry keys.

Multi Instance isolation

Multiple instances of SQL can be installed on a machine. Each instance has its own copy of search components. Changes to components in one instance do not affect other instances. Each instance of SQL Server 2005 binaries are installed in a different folder. Each instance of SQL Server 2005 has a corresponding instance of fulltext search service. Fulltext search service instance has a registry root and install path. The search registry root is a node in the SQL instance registry and the search install path is the Binn directory of the SQL Server 2005 instance. Fulltext search instance information (registry root and install path) can also be found by expanding HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSFTESQLInstMap/<instance#>where instance# is the SQL Server 2005 instance number (1, 2, 3 …). The value of Path gives the path to search binaries, and RegRoot points to the registry root for the instance. The example below illustrates the registry layout for a multi instance SQL installation.

e.g.:

Fulltext Search instance map for instance #1, pointing to install path and registry root

Binn folder D:\MSSQL\MSSQL.1\MSSQL\Binn

Registry root HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch

 

 

Fulltext Search instance map for instance #2, pointing to install path and registry root

Binn folder D:\Program Files\Microsoft SQL Server\MSSQL.2\MSSQL\Binn

Registry root HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.2\MSSearch

Fulltext search registry root for instance 1, 2 inside SQL Server 2005 instance root

Registering Components

Components need to be installed for individual instances of SQL Server 2005. WordBreakers and Filters have slightly different installation procedure. First, install the binaries and configuration files (if any) to the SQL Server 2005 instance Binn folder, such as C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn. Follow the Registration procedure below.

WordBreaker registration:

1. Add the COM ClassID(s) as a key to <InstanceRoot>\MSSearch\CLSID. The default value of this key is the path to the component. If the component is installed in the instance Binn folder, the full path is optional. ThreadingModel can also be specified here. It is not required to register the ClassIDs as regular COM components.

2. Add a key to the <InstanceRoot>\MSSearch\Language for the installed language. Fill in the Locale number and StemmerClass, WBreakerClass ClassIDs. See IStemmer interface in MSDN for more information about stemming.

 https://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/wbrscenario_4ckl.asp

3. Configuration files include lexicons, noise word files and thesaurus. These are optional. Lexicons are configuration files commonly used by WordBreakers, and are internal to WordBreaker implementations. Noise files are used to eliminate common words from the index. A language specific noise word can be provided and its path specified in the NoiseFile field. (See below for a sample). Thesaurus file is also optional and can be specified. A thesaurus is used at query time to expand query phrases.

Component registration sample for English WordBreaker / Stemmer.

 

Filter registration:

1. Add the COM ClassID(s) as a key to <InstanceRoot>\MSSearch\CLSID. The default value of this key is the path to the component. If the component is installed in the instance Binn folder, the full path is optional. ThreadingModel can also be specified here. It is not required to register the ClassIDs as regular COM components.

2. Add a key to the <InstanceRoot>\MSSearch\Filters for the installed language. Fill in the Locale number and StemmerClass, WBreakerClass ClassIDs. See IStemmer interface in MSDN for more information about stemming.

Component registration sample for .aspx IFilter

 

Signature Validation

Component signature is validated before it is used. All components need to be signed, by the component publisher, and the publisher needs to be trusted on the machine. More information on code signing is available at MSDN. Signature verification can be disabled, this is not recommended, see admin settings below.

Admin Settings

Load OS Resources: By default, loading components (filters/word breakers) installed with the OS is disabled. This is the recommended setting, and helps isolate different instances of SQL Server 2005. This setting can be changed by the command

Exec Sp_fulltext_service ‘load_os_resources’, 1

Verify Signature: Signature verification is enabled by default. Disabling this is not recommended. This setting can be changed by the command

Exec Sp_fulltext_service ‘verify_signature’, 0

Apendix

For example, for the default SQL 2005 instance, the German word breaker registry key is at: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\CLSID\{6DE705A5-6467-43DC-9CE3-CCFE08B3F645}

The stemmer registry key for the default instance is at:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSearch\CLSID\{9EA69E16-AD5E-43ED-B90E-73D58771D3F6}

In general it is better to register a new language for the specific wordbreaker. As long as the column that need to be indexed in that specific language, you still have this feature plus all of the regular language support. This will be a better solution rather than replacing the default language. If a column just needs regular word breaking, you can still use default language wordbreaker.

When registering a new LCID, the following TSQL command needs to execute in SQL Server 2005:

 

exec sp_fulltext_service 'update_languages'

 

It is possible to create a new LCID , for example LCID 1034, and register text parser, stemmer, ignored tokens file and thesaurus file for it. Then Fulltext Index can be created using TSQL command

 

CREATE FULLTEXT INDEX ON table1(column2 LANGUAGE 1034) KEY INDEX ncidx1 ON SpecialTCDB_Catalog

The TSQL code to create registry key and populate it:

 

exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE\Microsoft\MSSQLSERVER\MSSearch\Language\LCID', 'Locale', 'REG_DWORD', 1034

exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE\Microsoft\MSSQLSERVER\MSSearch\Language\LCID', 'NoiseFile', 'REG_SZ', 'noiseenu.txt'

exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE\Microsoft\MSSQLSERVER\MSSearch\Language\LCID', 'StemmerClass', 'REG_SZ', '{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}'

exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE\Microsoft\MSSQLSERVER\MSSearch\Language\LCID', 'TSaurusFile', 'REG_SZ', 'TSENU.XML'

exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE\Microsoft\MSSQLSERVER\MSSearch\Language\LCID', 'WBreakerClass', 'REG_SZ', '{188D6CC5-CB03-4C01-912E-47D21295D77E}'