About content sources (Search Server 2008)
Applies To: Microsoft Search Server 2008
Topic Last Modified: 2009-04-27
Note
Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.
Content is any item that can be crawled, such as a Web page, a Microsoft Office Word document, business data, or an e-mail message. Content resides in a content repository, such as a Web site, file share, or SharePoint site. A content source specifies settings that define how and on what schedule content is crawled. It includes one or more addresses of a content repository from which to start crawling, also named start addresses. These settings apply to all start addresses within the whole content source.
Default content source
If your organization has to crawl only the content that is contained in the SharePoint sites, you might not have to create an additional content source. Search Server 2008 defines a default content source during its initial deployment. The default content source is named Local Office SharePoint Server sites. The start addresses of all Web applications in the server farm are automatically included as part of the default content source. This content source is not crawled, by default. To index the content in the default content source, you have to either manually start or schedule crawls for it.
Creating a new content source
When you create a content source, you specify settings that define the kind of content it crawls, when the content is crawled, and crawling behavior, such as how deep to crawl within the namespace of the start address or how many server hops to allow. If you have multiple kinds of content repositories that you want to crawl, or you want crawl some content repositories on different schedules, you have to create additional content sources. Search Server has one Shared Service Provider (SSP) that supports up to 500 content sources. For more information, see the “Plan content sources” section of Plan to crawl content (Search Server 2008). For more information about how to configure crawling behavior, see Limit or increase the quantity of content that is crawled (Search Server 2008).
Types of content repositories
You can crawl only one kind of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares. But you cannot create a single content source that contains URLs for both SharePoint sites and file shares.
The following table lists the kinds of content that Search Server can crawl and index:
This kind of content source | Includes this kind of content |
---|---|
SharePoint sites |
|
Web sites |
|
File shares |
|
Exchange public folders |
|
Lotus Notes |
|
Start address of content
Each content source maintains a list of start addresses that the crawler uses to connect to the repository of content. Each content source can contain up to 500 start addresses. You cannot crawl the same address using multiple content sources. For example, if you use a particular content source to crawl a site collection and all its subsites, you cannot use a different content source to crawl one of those subsites on a different schedule.
Crawling content
You can use a content source to manually start a crawl or schedule when and how often the selected content source is crawled. If you want to crawl content in a part of your content source on a different schedule, you must create a separate content source for that content. For performance and manageability reasons, we recommend that you use as few content sources as possible. For more information about starting a crawl manually or scheduling a crawl, see Crawl content (Search Server 2008).
Authentication
When the crawler accesses the start addresses listed in a content source, the crawler must be authenticated by and granted access to the servers that host that content. The user account that is used by the crawler must have at least read permission to crawl content. By default, Search Server uses the default content access account and uses NTLM when authenticating with servers. For more information, see Configure how the crawler authenticates (Search Server 2008).
See Also
Concepts
Plan to crawl content (Search Server 2008)
Configure searches to return blog post results (Search Server 2008)
Configure client certificates for crawling an SSL site (Search Server 2008)