Crawling case sensitive repositories using SharePoint Server 2010

Introduction

There are many repositories which are case sensitive. To elaborate with an example, in case sensitive repositories the links:

 

https://myhost/CaseSensitivePage.htm

and

https://myhost/casesensitivepage.htm

 

represent different pages. The crawlers while indexing such repositories have to preserve the case of the discovered links to keep them valid. Examples of such repositories are:

- web sites deployed on Apache Server

- Linux file shares

- Business Data Catalogs

- Etc.

By default, SharePoint Search crawler normalizes all the links that it discovers and converts them to all lower case. This normalization does not allow crawling of the case sensitive repositories. SharePoint Server 2007 SP1 had limited support for crawling case sensitive repositories that allowed administrators to set a registry key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Applications\<GUID>\Gather\Portal_Content\CaseSensitiveURLs

When set, this registry key will ensure that all the crawling operations preserve case. More details about this provision can be found in this KB article.

SharePoint Server 2010 solution

SharePoint Server 2010 extends this functionality by giving more flexibility to the admins so that they can explicitly specify patterns of repositories/hosts/links that they want to crawl while preserving case of links. This is achieved by creating case sensitive crawl rules. The global case preservation flag is set-able via OM and PowerShell (discussed at the end of post).

Case sensitive crawl rules

SharePoint Server 2010 extends crawl rules to add two major useful features: Regular Expressions and Case Preservation. Regular Expressions in crawl rules are discussed in this blog post, so we will not talk in detail about it here and focus instead on case preservation support. The crawl rules can be made case sensitive and the links that match with that particular rule are then case preserved.

 

 

Creating case sensitive crawl rule via UI

A case sensitive crawl rule can be created through the crawl rule UI and it’s new “Match Case” checkbox. The screenshot below shows the checkbox for enabling case preservation in crawling.

 

 

 

Creating case sensitive crawl rule via PowerShell 

Case sensitive crawl rules can also be created using PowerShell. The script below creates a case sensitive crawl rule for a host called “mycasesensitivehost”:

# Get the search application object

$app = Get-SPEnterprisesearchServiceApplication "<SSA Name>”

# Create an inclusion crawl rule

$rule = New-SPEnterpriseSearchCrawlRule -SearchApplication $app -Path “https://mycasesensitivehost/*” -Type 0

# Set CaseSensitiveURL to true

$rule.CaseSensitiveURL = $true

#Update the rule to reflect changes

$rule.Update()

 

How does it work?

If an admin wants a particular repository (for example: https://mycasesensitivehost/) to be crawled with the links’ case preserved, the admin needs to perform following three steps:

1. Create an inclusion crawl rule that matches links belonging to the repository: https://mycasesensitivehost/*

2. Check the “Match Case” checkbox.

3. Create content source for crawling with start address for the repository (eg: https://mycasesensitivehost/StartPage.htm” and start a FULL crawl! The crawler has to go through a full crawl on the first run n order for the case preservation setting to take effect.

The crawler by default normalizes the links and converts them to lower case and any links that do not match will be normalized and converted to lower case.

 

How does it appear in search result?

The crawler, if configured to Match Case, will preserve cases for links in search results when case differences mean different documents – as in the example below.

How does it appear in crawl logs?

Crawl logs show the links with preserved case. Please see the image below:

 

Notice that the links coming from: “mycasesensitivehost” are case preserved and ones coming out of “mynormalhost” are all lower case.

Setting global case preservation via OM/PowerShell

The Search Application Object exposes a property: “CaseSensitiveCrawling” that is get- and set-able via SearchApplication.GetProperty and SearchApplication.SetProperty methods. Below is an example of manipulating this property via PowerShell.

# Get the search application object

PS D:\> $searchApp = Get-SPEnterpriseSearchServiceApplication -local

 

# Check the property value; which is False by default.

PS D:\> $searchApp.GetProperty("CaseSensitiveCrawling")

False

 

# Now set the property to true

PS D:\> $searchApp.SetProperty("CaseSensitiveCrawling", 1)

 

# Check if property is set.

PS D:\> $searchApp.GetProperty("CaseSensitiveCrawling")

True

When “CaseSensitiveCrawling” property is set, the crawler will preserve case of all links crawled by the given search application.

 

I hope this post sheds light on how to configure case sensitive crawls in SharePoint 2010. I welcome any questions and comments.

 

Syed Anas Hashmi

Software Design Engineer in Test

Microsoft Corp.