MOSS Enterprise Search - 16 things you might not know

Hello everybody \o/ - a few bits and pieces you might find useful when designing and deploying Enterprise Search based on Office SharePoint Server 2007 - most of these tips can be credited to some Partner training delivered by Morten Schioldan.

Really interesting things

  • On document libraries which support approval, users might be able to search and view abstracts for unpublished content which they would otherwise not be able to see. This happens if the crawler account has permission to view drafts, which is separate from any security trimming which will be applied. Using an account with only reader permissions would prevent this, but conversely a user with rights to see draft content could only search content from the last published version. Since the files share a URL it's either one or the other - a decision to be made by your customer.
  • Keyword search has moved from implicit OR to implicit AND as standard, although this can be configured on calls through the web service. Additionally, Nigel Bridport pointed out to me that if you use the Advanced Search, which in effect creates a SQL query, you can specify to AND or OR your requests as well as wildcard your keywords, etc.
  • Only a full crawl will re-index ASPX files. Therefore, if you have removed documents from a document library, the library may still appear in the results if you search for the document name after an incremental crawl. Graham Tyler spotted this one.

Capacity

  • There is a tested recommended maximum of 4 SSPs per farm, and a hard limit of 20.
  • The tested recommended maximum is 50 million documents across all content sources in an index. Since you can have only one index server per Shared Services Provider, the recommended approach for more capacity is to add another SSP. There is no supported way to union the results. Don't forget BDC items count to your total.
  • There is a hard limit of 500 start addresses per content source. If the default content source is full, additional new sites will not have their start addresses registered (anywhere), so you'll need to add them manually to another content source, or change the default content source.
  • Information about the size of search indexes and search databases relative to corpus size is due to be released on Technet shortly.

Configuration and topology

  • Since the Indexer crawls a Web Front End for SharePoint content, in a load balanced scenario, any of the Web Front End servers could get hit. You can override this and specify a single WFE for the Indexer to use. In this case, you could exclude this from the load balancer and have it as a dedicated WFE for indexing.
  • Query and Index roles can run on the same server. However, to add extra Query servers, the Indexer must run alone. E.g. you would scale out from Query+Index on one server to Index on one server and Query on two servers.
  • Collapsing your WFE and Query roles onto the same server can improve query performance, especially when custom security trimmers are used on the WFEs to iterate over the results.
  • Architectures can be mixed between roles (e.g. 32bit WFE, 64bit Indexer) but not in the same role.
  • 64bit is recommended for the Indexer, but most custom IFilters, like Adobe PDF, don't support this new platform just yet.

Crawling and Querying

  • The default maximum file size the Indexer will download and parse is 16mb. This is a registry entry on the Indexer - one to update if customers deal with large PowerPoint slidedecks etc.
  • Search term word stemming is off by default. You can enable this in the Core Search Results web part, but it may skew the new improved ranking mechanism.
  • Search usage reports gather data from client side asynchronous JavaScript when a user selects a result from the results page. Therefore, reports don't show search terms used through the API or the Query Web Service.
  • It's kind-of obvious, but forms authentication to external websites is not supported for crawling.