Changes in SharePoint 2013 Search
Changes in SharePoint 2013 crawling
Continuous crawl
In SharePoint 2013, you can configure crawl schedules so that crawls are performed continuously.
What are continuous crawls?
Continuous crawls help to keep the search index and search results as fresh as possible. This is a new feature in SharePoint 2013. Because of changes in how the index is created and stored, a document can appear in the index within seconds of going through the content processing component – you no longer have to wait for long index merges until it shows in results. It also means you can get the latest changes even while a full crawl is starting, so you can see results before full crawl completes.
Note: It only works with SharePoint sites as content sources.
How do they work?
In SharePoint 2010 we have Full Crawls and Incremental Crawls. Both crawls you can schedule and especially the Incremental Crawl is scheduled often. With Incremental Crawl all content changed since last crawl is retrieved and indexed. But the results may not be what you might expect: the index is not as accurate as you hoped it to be.
Let’s examine the below scenario:
You have scheduled to perform an Incremental Crawl every 15 minutes.
In the first schedule (A) the crawler retrieves a content set and can handle that set within 15 minutes. No problem here. The next schedule (B) is executed and this time the crawler retrieves a bigger content set, but now it cannot be handled within 15 minutes, but it takes 20 minutes. Now a problem arises. Because only one crawler at a time can be executed, the third schedule will be killed immediately. We have to wait for the next window that the schedule takes place. This will be (C) at 45 minutes. So, in this case we had 30 minutes between 2 Incremental Crawls instead of the scheduled 15 minutes. So, the bigger the content set is to handle by the crawler the more risk there is to miss scheduled crawling windows. This results in a lesser up to date search index.
SharePoint 2013
With Continuous crawling you don’t have to schedule anymore. Crawlers are running in parallel and the crawler gets changes from the SharePoint sites every 15 minutes. This is the default setting and it can be changed using PowerShell. So, when a crawl is executed and finished, it continues to crawl immediately in spite of the length. In our previous situation when a crawl is taking more than 15 minutes to handle the content set, another crawler is started.
It is even possible that a second crawl is started only a few minutes after the first crawl. This way your newly added or changed data is almost immediately available through search.
By default, a new crawl is run every 15 minutes, but the SharePoint administrator can change this interval using the PowerShell cmdlet Set-SPEnterpriseSearchCrawlContentSource
When should I use them?
You should use this feature for your search driven applications. Every time when it matters that new results appear right after they have been created.
Enabling continuous crawl has the following advantages:
The search results are very fresh, because the SharePoint content is crawled frequently to keep the search index up to date.
Ex: in social computing whenever there is new feed it can appear fast in the search
The search administrator does not have to monitor changing or seasonal demands for content freshness. Continuous crawls automatically adapt as necessary to the change rate of the SharePoint content.
This new feature is not so important, when you just use a document library to store and exchange documents. In this case the ‘old’ incremental crawl should be enough.
Note: This new feature is not so important, when you just use a document library to store and exchange documents. In this case the ‘old’ incremental crawl should be enough.
Now, a few things to bear in mind. Continuous crawls can only be enabled for content sources based on SharePoint sites.
Also, it can be resource intensive for your server running all the crawlers, We need to enable then when we have adequate hardware which can take the load on to the SharePoint servers.
Summary
With continuous crawling, data is almost immediately available for search. No more issues with incremental schedules and big data which can cause missed schedule windows. Using continuous crawls and the Content by Search web part, for example, gives you lots of opportunities to build awesome search-driven solutions.
Host distribution rules removed
In SharePoint Server 2010, Search application administrators could remove items from the index through Central Administration. In SharePoint 2013, you can remove items from the index only by using the crawl logs.
Discovering structure and entities in unstructured content
You can configure the crawler to look for "entities" in unstructured content, such as in the body text or the title of a document. These entities can be words or phrases, such as product names. To specify which entities to look for in the content, you can create and deploy your own dictionaries. For locations, you can use the pre-populated location extraction dictionary that SharePoint 2013 provides.
You can store these entities in your index as separate managed properties, and later use those properties for example in search refiners to help users filter their search results.
To improve search relevance, the document parsing functionality in the content processing component analyzes the structure of documents in addition to the contents. Document parsers both extract useful metadata and remove redundant information. For example, parsers extract headings and subheadings from Word documents, as well as titles, dates and authors from within slides in PowerPoint presentations. For HTML content, redundant generic information such as menus, headers and footers are classified as such and removed from document summaries in the search results.
Health Monitoring Reports
In SharePoint 2013, you access health monitoring reports (previously called Search Administration Reports) on the left navigation pane under Crawling or Queries & Results for individual Search service applications. Enhancements to the reports make them easier to interpret.
Query health monitoring reports include:
- Query Latency Report
- Query Latency Report for Index Engine
- Query Latency Report for SharePoint Default IMS Flow
- Query Latency Report for Federation
- Query Latency Report for Local SharePoint Search Results
- Query Latency Report for People Search Results
Crawl health monitoring reports include:
- Crawl Summary Report
- Crawl Rate Report
- Crawl Processing Per Document Report
- Crawl Queue Load Report
New search architecture
Search uses a new, component based architecture that provides maximum flexibility in defining the topology to support search requirements for performance, availability and fault-tolerance.
The crawl component crawls content sources to collect crawled properties and metadata from crawled items. It sends this information to the content processing component.
The content processing component transforms the crawled items so that they can be included in the search index. The component also maps crawled properties to managed properties. In addition, the content processing component interacts with the analytics processing component.
The analytics processing component analyzes the crawled items and how users interact with their search results. The information is used to improve the search relevance, and to create search reports and recommendations.
The index component receives the processed items from the content processing component and writes them to the search index. The component also handles incoming queries, retrieves information from the search index and sends the results back to the query processing component.
The query processing component analyzes incoming queries to help optimize precision, recall (which items are returned in the results) and ranking (the order of those items). The query is then sent to the index component which returns a set of search results for the query. The results can then be further processed before they are presented to the user as the search results for his/her query.
The search administration component runs the required search processes and adds and initializes new instances of search components.
The crawl database contains detailed tracking and historical information about crawled items such as documents and links. The database holds information such as the last crawl time, the last crawl ID and the type of update during the last crawl (add, update, delete).
The link database stores the information extracted by the content processing component and click-through information.
The analytics reporting database stores the results of search and usage analysis, such as the number of times an item has been viewed.
The search administration database stores settings for the Search service application, such as the topology, crawl rules, query rules and the mappings between crawled and managed properties.
Multi-tenant hosting
SharePoint 2013, the search system supports multi-tenant hosting.