Share via


Limit or increase the quantity of content that is crawled (Search Server 2008)

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2009-04-21

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

During operations, you typically need to change the quantity of content that you are currently crawling. For example, you might want to:

  • Discontinue crawling some sites within a particular namespace that is defined by an existing content source.

  • Crawl sites at a different depth.

  • Change the number of file types to crawl — that is, start crawling file types that you have not crawled before, discontinue crawling of certain file types that you are currently crawling, or both.

As the needs of your organization change, you might also crawl entirely new sources of content. For more information about crawling entirely new sources of content, see About content sources (Search Server 2008).

You can increase or limit the quantity of content that is crawled by using:

  • Crawl settings in the content sources   For example, you can specify to crawl only the start addresses that are specified in a particular content source, or you can specify how many levels deep in the namespace (from those start addresses) to crawl and how many server hops to allow. Note that the options that are available within a content source for specifying the quantity of content that is crawled vary by content-source type.

  • File type inclusions   You can choose the file types that you want to crawl.

  • Crawl rules   You can use crawl rules to exclude all items in a given path from being crawled. This is a good way to ensure that subsites that you do not want to index are not crawled with a parent site that you are crawling. You can also use crawl rules to increase the amount of content that is crawled — for example crawling complex URLs for a given path.

Crawl settings

For each content source, you can select how extensively to crawl the start addresses in that content source. You also specify the behavior of the crawl, sometimes called the crawl settings. The options you can choose for a particular content source vary somewhat based on the content source type that you select. However, most options determine how many levels deep in the hierarchy from each start address listed in the content source are crawled. Note that this behavior is applied to all start addresses in a particular content source.

The options available in the properties for each content source vary depending upon the content source type that is selected. The following table describes the crawl settings options for each content source type.

Content source type Crawl settings options

SharePoint sites

  • Everything under the host name for each start address

  • Only the SharePoint site of each start address

Web sites

  • Only within the server of each start address

  • Only the first page of each start address

  • Custom — Specify page depth and number of server hops.

    Note

    The default setting for this option is unlimited page depths and server hops.

File shares

  • The folder and all subfolders of each start address

  • Only the folder of each start address

Exchange public folders

  • The folder and all subfolders of each start address

  • Only the folder of each start address

As the preceding table shows, search services administrators can use crawl setting options to limit or increase the quantity of content that is crawled.

The following table describes best practices when configuring crawl setting options.

For this content source type If this pertains Use this crawl setting option

SharePoint sites

You want to crawl the content on a particular site collection on a different schedule than other site collections.

Crawl only the SharePoint site of each start address

Note

This option accepts any URL, but will start the crawl from the top-level site of the site collection that is specified in the URL you enter. For example, if you enter http://contoso/sites/sales/car but http://contoso/sites/sales is the top-level site of the site collection, the site collection http://contoso/sites/sales and all of its subsites are crawled.

SharePoint sites

You want to crawl all content in all site collections in a particular Web application on the same schedule.

Crawl everything under the host name of each start address

Note

This option accepts only host names as start addresses, such as http://contoso. You cannot enter the URL of a subsite, such as http://contoso/sites/sales when using this option.

Web sites

Content on the site itself is relevant.

-or-

Content available on linked sites is not likely to be relevant.

Crawl only within the server of each start address

Web sites

Relevant content is on only the first page.

Crawl only the first page of each start address

Web sites

You want to limit how deep to crawl the links on the start addresses.

Custom — Specify the number of pages deep and number of server hops to crawl

Note

We recommend you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet.

Note

You can also use one or more crawl rules to specify what content to crawl. For more information, see Use crawl rules to determine what content gets crawled (Search Server 2008).

File shares

Exchange public folders

Content available in the subfolders is not likely to be relevant.

Crawl only the folder of each start address

File shares

Exchange public folders

Content in the subfolders is likely to be relevant.

Crawl the folder and subfolder of each start address

File-type inclusions and IFilters

Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the index server that supports those file types. Several file types are included automatically during initial installation. By analyzing the query logs, you can discover which file types contain content that your end users want to query. You might discover the need to crawl a file type that you are not currently crawling or you might want to exclude certain file types from being crawled.

When you add file types to the file type inclusions list, you must also ensure that you have an IFilter that can be used to parse the file type when crawled. If such an IFilter is not installed, the content in the files of that file type will not be indexed, and will not be searchable. However, the metadata of files of that particular file type will be crawled and will be searchable. For example, if you add PDF to the file type inclusions list but do not install an IFilter for the PDF file type, the content of PDF files will not be indexed, but the metadata of PDF files will.

Microsoft Search Server 2008 provides several IFilters, and more are available from Microsoft and third-party vendors. If necessary, software developers can create IFilters for new file types. To install and register additional IFilters proved by Microsoft with Search Server 2008, see How to register Microsoft Filter Pack with SharePoint Server 2007 and with Search Server 2008 (https://go.microsoft.com/fwlink/?LinkId=110532). For more information about IFilters, including those from third-party vendors, see Filter Central (https://go.microsoft.com/fwlink/?LinkID=131255).

For a list of the file types that are supported by the IFilters that are installed by default and which file types are enabled for crawling by default, see Crawl more file types by installing IFilters (Search Server 2008).

Limit or exclude content by using crawl rules

You can edit existing crawl rules or create new crawl rules to exclude all items or include specific items for a particular path.

Note

When you add a start address to a content source and accept the default behavior, all subsites or folders below that start address are crawled unless you exclude them by using one or more crawl rules.

Crawl rules apply to a particular URL, or to a set of URLs represented by wildcards. (This URL is also referred to as the path affected by the rule.) You use crawl rules to do the following things:

  • Avoid crawling less relevant content by excluding one or more URLs. This also helps to reduce the use of server resources and network traffic, and to increase the relevance of search results.

  • Crawl links on the URL without crawling the URL itself. This option is useful for sites with links of relevant content when the page containing the links does not contain relevant information or should not be exposed to end users in search results pages.

  • Enable complex URLs to be crawled. This option crawls URLs that contain a query parameter specified with a question mark. Depending upon the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to less relevant sites, it is a good idea to enable this option on only sites where the content available from complex URLs is known to be relevant.

    Note

    This option has no effect when crawling SharePoint sites, because Search Server 2008 enumerates all content when crawling SharePoint sites.

Note

Crawl rules apply simultaneously to all content sources.

Often, most of the content for a particular site address is relevant, but not a specific subsite or range of sites below that site address. By selecting a focused combination of URLs for which to create crawl rules that exclude unneeded items, search services administrators can maximize the relevance of the content in the index while minimizing the impact on crawling performance and the size of search databases. Creating crawl rules to exclude URLs is particularly useful when planning start addresses for external content because the impact on resource usage is not under the control of people in your organization.

When creating a crawl rule, you can use standard wildcard characters in the path. For example:

  • http://server1/folder* contains all Web resources with a URL that starts with http://server1/folder.

  • *://*.txt includes every document with the .txt file name extension.

Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.

To limit or increase the quantity of content that is crawled, you can perform the following procedures: