Chapter 6: Search (Professional SharePoint 2010 Development)

Summary: Read an overview of the new Enterprise Search product line, tour the new architecture, learn patterns that are common to developing extensions and applications, examine how to customize all aspects of search (for example, user experience, social search, federation, connectors, and content processing), and follow examples that get you started on custom search projects.

Applies to: Business Connectivity Services | SharePoint Foundation 2010 | SharePoint Server 2010 | Visual Studio

This article is an excerpt from Professional SharePoint 2010 Development by Tom Rizzo, Reza Alirezaei, Paul J. Swider, Scot Hillier, Jeff Fried, and Kenneth Schaefer from Wrox Press (ISBN 978-0-470-52942-3, copyright Publisher Name 2010, all rights reserved). No part of these chapters may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, electrostatic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher, except in the case of brief quotations embodied in critical articles or reviews.

Contents

  • Introduction

  • Search Options with SharePoint 2010

  • Search User Experience

  • Search Architecture and Topologies

  • Developing with Enterprise Search

  • Customizing the Search User Experience

  • Search Connectors and Searching LOB Systems

  • Working with Federation

  • Working with the Query OM

  • Social Search

  • Content Enhancement

  • Extending Search Using the Administrative OM

  • Summary: Customizing Search with SharePoint 2010

  • Additional Resources

  • About the Authors

Click to grab code  Download code

Introduction

Microsoft has been in the Enterprise Search business for a long time. The last two years have seen an increased focus in this area, including the introduction of Search Server 2008 and the acquisition of FAST Search and Transfer. Search is becoming strategic to many businesses, and Microsoft's investments reflect this.

Enterprise Search delivers content for the benefit of the employees, customers, partners, or affiliates of a single company or organization. Companies, government agencies, and other organizations maintain huge amounts of information in electronic form, including spreadsheets, policy manuals, and web pages, just to name a few. Contemporary private datasets can now exceed the size of the entire Internet in the 1990s, running into petabytes or even exabytes of information. This content might be stored in file shares, websites, content management systems, or databases, but without the ability to find this corporate knowledge, managing even a small company would be difficult.

Enterprise Search applications are found throughout most enterprises in obvious places such as intranet search, and in less visible ways; for example, search-driven applications often do not look like "search." Search supports all these applications, and also complements the other workloads in SharePoint 2010 (Insights, Social, Composites, and the like) in powerful ways.

Learning to develop great applications, including search, will serve you and your organization very well. You can build more flexible, more powerful applications that bridge different information silos while providing a natural, simple user experience.

This chapter provides an introduction to developing with search in SharePoint 2010. First, it covers the options, capabilities, and architecture of search. A section on the most common search customizations gives you a sense of what kind of development you are likely to run into. Next, it runs you through different areas of search: social search, indexing connectors, federation, content processing, ranking and relevance, the UI, and administration. In each of these areas, this chapter provides a deeper look at the capabilities, discusses how a developer can work with them, and includes an example. Finally, the summary gives an overview of the power of search and offers some ways to combine it with other workloads in SharePoint 2010.

Search Options with SharePoint 2010

With the 2010 wave, Microsoft has added new Enterprise Search products and updated existing ones — bringing in a lot of new capabilities. Some of these are brand new, some are evolutions of the SharePoint 2007 search capabilities, and some are capabilities brought from FAST. The result is a set of options that lets you solve any search problem, but because of the number of options, it can also be confusing.

Figure 1 shows the Enterprise Search products in the 2010 wave. There are many options; in fact, there are 9 offerings for Enterprise Search. This is evidence of the emphasis Microsoft is putting on search, and also a byproduct of the ongoing integration of the FAST acquisition.

Figure 1. Enterprise Search products in the 2010 wave

Enterprise Search products in the 2010 wave

 

This lineup might seem confusing at first, and the sheer number of options is a bit daunting. As you will see, there is some method to this madness. For most purposes, you will be considering only one or two of these options.

Looking at the lineup from different perspectives helps in understanding it. There are three main dimensions to consider:

  • Tier (labeled along the right side of Figure 1): Microsoft adopted a three-tier approach in 2008 when it introduced Search Server 2008 Express and acquired FAST. These tiers are entry level, infrastructure, and high end. Search Server 2010 Express and the search in SharePoint Foundation 2010 are entry level; SharePoint Server and Search Server 2010 comprise the infrastructure tier, and any option labeled "FAST" is high end.

  • Integration (labeled along the left side of Figure 1): Search options integrated with SharePoint have features, such as social search, that are built on other parts of SharePoint. Standalone search options do not require SharePoint, but they lack these features.

  • Application (labeled across the top of Figure 1): Applications are divided into Internet applications or Productivity applications. For the most part, the distinction between search applications inside the firewall (Productivity) and outside the firewall (Internet) is a pure licensing distinction. Inside the firewall, products are licensed by server and by client access license (CAL). Outside the firewall, it is not possible to license clients, so products are licensed by server. The media, documentation, support, and architecture are the same across these applications (for example, horizontally across Figure 1). There are a few minor feature differences, which are called out in this chapter where relevant.

There is another perspective useful in understanding this lineup: codebase. The acquisition of FAST brought a large codebase of high-end search code, different from the SharePoint search codebase. As the integration of FAST proceeds, ultimately all Enterprise Search options will be derived from a single common codebase.

At the moment, there are three separate codebases from which Enterprise Search products are derived. The first is the SharePoint 2010 search codebase, which is an evolution from the MOSS 2007 search code. Search options derived from this codebase are in medium gray boxes, as shown in Figure 1. The second is the FAST standalone codebase, which is a continuation of the code from FAST ESP, the flagship product provided by FAST up to this time. Search options derived from this codebase are shown in light gray boxes in Figure 1. The third is the FAST integrated codebase, which is a new one resulting from reworking the ESP code, integrating it with the SharePoint search architecture, and adding new elements. Search options derived from this codebase are shown in dark gray boxes in Figure 1.

The codebase perspective is useful for developers because it provides a sense of what to expect with APIs and system behavior. The FAST integrated codebase uses the same APIs as the SharePoint search codebase, but extends those APIs to expose additional capabilities. The FAST standalone codebase uses different APIs. Note that search products from the FAST standalone codebase are in a special status, licensed through FAST as a subsidiary and under different support programs. This book does not cover products from the FAST standalone codebase or the APIs specific to them.

If you consider the search options across application areas as the same, and disregard the FAST standalone codebase, you are left with five options in the Enterprise Search lineup, rather than nine. Look at each of these options and see where you might use each one. This chapter also introduces some shorter names and acronyms for each option to make the discussion simpler.

SharePoint Foundation

Microsoft SharePoint Foundation (also called SharePoint Foundation, or SPF) is a free, downloadable platform that includes search capabilities. The search is pretty basic; it is limited to content within SharePoint, no search scopes, and no refinement. SPF is in the entry-level tier and is integrated with SharePoint.

If you are using SharePoint Foundation and care about search (which is likely, because you are reading this chapter!), you should forget about the built-in search capability and use one of the other options. Most likely this will be Search Server Express, because it is also free.

Search Server 2010 Express

Microsoft Search Server 2010 Express (also called Search Server Express or MSSX) is a free, downloadable standalone search offering. It is intended for tactical, small-scale search applications (such as departmental sites), requiring little or no cost and IT effort. Microsoft Search Server 2008 Express was a very popular product; Microsoft reports that there have been over 100,000 downloads. There is a lot added with the 2010 wave; better connectivity, refinement, improved relevance, and much more.

Search Server Express is an entry-level standalone product. It is limited to one server with up to 300,000 documents. It lacks many of the capabilities of SharePoint Server, such as taxonomy, or people and expertise search, not to mention the capabilities of FAST. It can, however, be a good enough option for many departments that require a straightforward site search.

If you have little or no budget and an immediate, simple, tactical search need, use Search Server Express. It is quick to deploy, easy to manage, and free. You can always move to one of the other options later.

Search Server 2010

Microsoft Search Server 2010 (also called Search Server or MSS) has the same functional capabilities as MSS Express, with full scale; up to about 10 million items per server and 100 million items in a system using multiple servers. It isn't free, but the per server license cost is low. MSS is a great way to scale up for applications that start with MSS Express and grow (as they often do).

MSS is an infrastructure-tier standalone product. Both MSS and MSS Express lack some search capabilities that are available in SharePoint Server 2010, such as taxonomy support, people and expertise search, social tagging, and social search (where search results improve because of social behavior), to name a few. And, of course, MSS does not have any of the other SharePoint Server capabilities (BI, workflow, and so on.) that are often mixed together with search in applications.

If you have no other applications for SharePoint Server, and need general intranet search or site search, MSS can be a good choice. But in most cases, it makes more sense to use SharePoint Server 2010.

SharePoint Server 2010

Microsoft SharePoint Server 2010 (also called SharePoint Server or SP) includes a complete intranet search solution that provides a robust search capability out of the box. It has many significant improvements over its predecessor, Microsoft Office SharePoint Server 2007 (also called MOSS 2007) search. New capabilities include refinement, people and expertise search with phonetic matching, social tagging, social search, query suggestions, editing directly in a browser, and many more. Connectivity is much broader and simpler, both for indexing and federation. SharePoint Server 2010 also has markedly improved its scale-out architecture, providing flexibility for different performance, scale, and availability needs.

SharePoint Server has three license variants in the 2010 wave; all with precisely the same search functionality. With all of them, Enterprise Search is a component or "workload," not a separate license. SharePoint Server 2010 is licensed in a typical Microsoft server/CAL model. Each server needs a server license, and each user needs a client access license (CAL). For applications where CALs do not apply (typically outside the firewall in customer-facing sites), there is SharePoint Server 2010 for Internet Sites, Standard (FIS-S) and SharePoint Server 2010 for Internet Sites, and Enterprise (FIS-E).

For the rest of this chapter, these licensing variants will be ignored, and we will refer to all of them as SharePoint Server 2010 or SP. All of them are infrastructure-tier, integrated offerings.

SharePoint Server 2010 is a good choice for general intranet search, people search, and site search applications. It is a fully functional search solution and should cover the scale and connectivity needs of most organizations. However, it is no longer the best search offered with SharePoint, given the integration of FAST in this wave.

FAST Search Server 2010 for SharePoint

Microsoft FAST Search Server 2010 for SharePoint (also called FAST Search for SharePoint or FS4SP) is a brand-new product. It is a high-end Enterprise Search product, providing an excellent search experience out of the box and the flexibility to customize search for very diverse needs at essentially unlimited scale. FS4SP is notably simpler to deploy and operate than other high-end search offerings. It provides high-end search, integrated with SharePoint.

The frameworks and tools used by IT professionals and developers are common across the SharePoint search codebase and the FAST integrated codebase. FAST Search for SharePoint builds on SharePoint Server, and integrates into the SharePoint 2010 architecture using some of the new elements, such as the enhanced connector framework and the federation framework. This means that FAST Search for SharePoint shares the same object models and APIs for connectors, queries, and system management. In addition, administrative and front end frameworks are common; basically the same management console and the same Search Center web parts.

Figure 2 shows how FAST adds on to SharePoint Server. In operation, both SharePoint servers and FAST Search for SharePoint servers are used. SharePoint servers handle crawling, accept and federate queries, and serve up people search. FAST Search for SharePoint servers handle all content processing and core search. The result is a combination of SharePoint search and FAST search technology in a hybrid form, plus several new elements and capabilities.

Figure 2. FAST adds on to SharePoint Server

FAST adds on to SharePoint Server

 

FAST Search for SharePoint provides significant enhancements to SharePoint’s Enterprise Search capabilities. This means that there are capabilities and extensions to APIs that are specific to FAST Search for SharePoint. For example, there are extensions to the Query Object Model (OM), to accommodate the additional capabilities of FAST such as FAST Query Language (FQL). The biggest differences are in the functionality available: a visual and "contextual" search experience; advanced content processing, including metadata extraction; multiple relevance profiles and sorting options available to users; more control of the user experience; and extreme scale capabilities.

Note

FAST Licensing Variants  Just like SharePoint Server, FS4SP has licensing variants for internal and external use. FS4SP is licensed per server, requires Enterprise CALs (e-CALs) for each user, and needs SharePoint Server 2010 as a prerequisite. FAST Search Server 2010 for SharePoint Internet Sites (FS4SP-IS) is for situations where CALs don't apply, typically Internet-facing sites with various search applications. In these situations, SP-FIS-E (enterprise) is a prerequisite, and SP-FIS-E server licenses can be used for either SP-FIS-E servers or FS4SP-IS servers. FS4SP and FS4SP-IS have essentially the same search functionality with a few exceptions. We will largely ignore these variants for the remainder of this chapter and refer to them both as FAST Search for SharePoint or FS4SP.

FAST Search for SharePoint handles general intranet search, people search, and site search applications, providing more capability than SharePoint Server does, including the ability to give different groups using the same site different experiences via user context. FS4SP is particularly well suited for high-value search applications such as those described below.

Choosing the Right Search Product

Most often, organizations that implement a Microsoft Enterprise Search product choose between SharePoint Server 2010's search capabilities and FAST Search for SharePoint. SharePoint Server's search has improved significantly since 2007, so it is worth a close look, especially if you are already running SharePoint 2007's search. FAST Search for SharePoint has many capabilities beyond SharePoint Server 2010's search, but it also carries additional licensing costs. By understanding the differences in features and the applications that can be addressed by each feature, you can determine whether you need the additional capabilities offered by FAST.

With Enterprise Search inside the firewall, there are two distinct types of search applications:

  • General-purpose search applications increase employee efficiency by connecting "everyone to everything." These search solutions increase employee efficiency by connecting a broad set of people to a broad set of information. Intranet search is the most common example of this type of search application.

  • Special-purpose search applications help a specific set of people make the most of a specific set of information. Common examples include product support applications, research portals ranging from market research to competitive analysis, knowledge centers, and customer-oriented sales and service applications. This kind of application is found in many places, with variants for essentially every role in an enterprise. These applications typically are the highest-value search applications, as they are tailored to a specific task that is usually essential to the users they serve. They are also typically the most rewarding for developers.

SharePoint Server 2010's built-in search is targeted at general-purpose search applications, and can be tailored to provide specific intranet search experiences for different organizations and situations. FAST Search for SharePoint can be used for general-purpose search applications, and can be an "upgrade" from SharePoint search to provide superior search in those applications. However, it is designed with special-purpose search applications in mind. So applications you identify as fitting the "special-purpose" category should be addressed with FAST Search for SharePoint.

Because SP and FS4SP share the connector framework (with a few exceptions covered later), you won't find big differences in connectors or security, which traditionally are areas where search engines have differentiated themselves. Instead, you see big differences in content processing, user experience, and advanced query capabilities. Examples of capabilities specific to FAST Search for SharePoint are:

  • Content-processing pipeline

  • Metadata extraction

  • Structured data search

  • Deep refinement

  • Visual search

  • Advanced linguistics

  • Visual best bets

  • Development platform flexibility

  • Ease of creating custom search experiences

  • Extreme scale and performance

Common Platform and APIs

There are more aspects in common between SharePoint Server 2010 Search and FAST Search for SharePoint than there are differences. The frameworks and tools for use by IT pros and developers are kept as common as possible across the product line, given the additional capabilities in FAST Search Server 2010 for SharePoint. In particular, the object models for content, queries, and federation are all the same, and the web parts are largely common. All of the products described above provide a unified Query Object Model. The result is that if you develop a custom solution that uses the Query Object Model for SharePoint Foundation 2010, for example, it will continue to work if you upgrade to SharePoint Server 2010, or if you migrate your code to FAST Search Server 2010 for SharePoint.

Figure 3 shows the "stack" involved with Enterprise Search, from the client down to the search cores.

Figure 3. Enterprise Search stack

Enterprise Search stack

 

For the rest of this chapter, we will describe one set of capabilities and OMs and call out any specific differences within the product line where relevant.

Search User Experience

Information workers typically start searches from the Simple Search box or by browsing to a site based on a Search Center site template. Figure 4 shows the Simple Search box that is available by default on all site pages. By default, this search box issues queries that are scoped to the current site, because users often navigate to sites that they know contain the information they want before they perform a search.

Figure 4. Simple search box

Simple search box

 

Search Center

Figure 5 shows a search site based on the Enterprise Search Center template. Information workers use Search Center sites to search across all crawled and federated content.

Figure 5. Search site based on Enterprise Search Center template

Site based on Enterprise Search Center template

 

By default, the Search Center includes two search tabs: All Sites and People. The Search Center includes an Advanced Search Box that provides links to the current user's search preferences and advanced search options.

Figure 6 shows the default view for performing an advanced search, with access to phrase search features, language filters, result type filters, and property filters.

Figure 6. Advanced search site

Advanced search site

 

All of the search user interfaces are intuitive and easy to use, so information workers can start searches in a straightforward way. When an information worker performs a search, the results are displayed on a results page, as shown in Figure 7. The SharePoint Sever 2010 Search core results page offers a user-friendly and intuitive user interface. People can use simple and familiar keyword queries, and get results in a rich and easy-to-navigate layout.

Figure 7. Search results page

Search results page

 

Visual Cues in Search Results with FAST

FAST Search for SharePoint adds visual cues into the search experience. These provide an engaging, useful, and efficient way for information workers to interact with search results. People find information faster when they recognize documents visually. A search result from FAST Search for SharePoint is shown in Figure 8.

Figure 8. Search results from FAST search

Search results from FAST search

 

Thumbnails and Previews

Word documents and PowerPoint presentations can be recognized directly in search results. A thumbnail image is displayed along with the search results to provide rapid recognition of information and, thereby, faster information finding. This feature is part of the Search Core Results web part for FAST Search Server 2010 for SharePoint, and can be configured in that web part.

In addition to the thumbnail, a scrolling preview is available for PowerPoint documents, enabling an information worker to browse the actual slides in a presentation. People are often looking for a particular slide, or remember a presentation on the basis of a couple of slides. This preview helps them recognize what they’re looking for quickly, without having to open the document.

Visual Best Bets

SharePoint Server 2010 Search keywords can have definitions, synonyms, and Best Bets associated with them. A Best Bet is a particular document set up to appear whenever someone searches for a keyword. It appears along with a star icon and the definition of that keyword. FAST Search Server 2010 for SharePoint adds the ability for you to define Visual Best Bets for keywords. This Visual Best Bet might be anything that you can identify with a URI; for example, an image, video, or application. It provides a simple, powerful, and very effective way to guide people’s search experiences.

These visual search elements are unique to FAST Search Server 2010 for SharePoint and are not provided in SharePoint Server 2010 Search.

Exploration and Refinement

SharePoint Server 2010 also provides a new way to explore information via search refinements, as shown on the left of Figure 9. These refinements are displayed down the left side of the page in the core search results. They provide self-service drill-down capabilities for filtering the search results returned. The refinements are automatically determined by SharePoint Server 2010, using tags and metadata in the search results. Such refinements include searching by the type of content (web page, document, spreadsheet, presentation, and so on), location, author, last modified date, and metadata tags. Administrators can extend the refinement panel easily to include refinements based on any managed property.

Refinement with FAST Search Server 2010 for SharePoint is considerably more powerful than refinement in SharePoint Server 2010. SharePoint Server 2010 automatically generates shallow refinement for search results that enable a user to apply additional filters to search results based on the values returned by the query. Shallow refinement is based on the managed properties returned from the first 50 results by the original query.

In contrast, FAST Search Server 2010 for SharePoint offers the option of deep refinement, which is based on aggregation of managed property values within the entire result set. These are shown in Figure 9 both in out-of-the-box form and as custom visual refiners.

Figure 9. Refining search results

Refining search results

 

Using deep refinement, you can find the "needle in the haystack," such as a person who has written a document about a specific subject, even if this document would otherwise appear farther down the result list. Deep refinement can also display counts and lets the user see the number of results in each refinement category. You can also use the statistical data returned for numeric refinements in other types of analysis.

Search is more than "find"; it is also "explore." In many situations, the quickest and most effective way to find or explore is through a dialogue with the machine; a "conversation" that enables the user to respond to results and steer to the answer or insight. The conversational search capabilities in FAST Search for SharePoint provide ways for information workers to interact with and refine their search results, so that they can quickly find the information they require.

Sort Results on Managed Properties

By default, SharePoint Server 2010 sorts results on each document's relevance rank. Information workers can re-sort the results by date modified, but these are the only two sort options in SharePoint Server 2010. With FAST Search Server 2010 for SharePoint, users can sort results on any managed properties, such as sorting by Author, Document Size, or Title. Relevance ranking profiles can also be surfaced as sorting criteria, which enables users to pick different relevance rankings as desired.

This sorting is considerably more powerful than sorting in SharePoint Server 2010 Search.

Similar Results

With FAST Search Server 2010 for SharePoint, results returned by a query include links to "Similar Results." When a user clicks on the link, the search is redefined and rerun to include documents that are similar to the result in question.

Result Collapsing

FAST Search Server 2010 for SharePoint provides a result collapsing capability which can be used for de-duplication and also for result roll-up. Documents that have the same value stored in a field in the index will be collapsed as one document in the search result. If that field is a managed property such as "author," all documents matching a given query with the same author can be rolled up in the result, and expanded by the user as desired. If that field is a checksum or other unique signature of the document's content, collapsing provides duplicate detection. This means that documents stored in multiple locations in a source system will be displayed only once during a search using the collapse search parameter. The collapsed results include links to "Duplicates." When a user clicks on the link, the search result displays all versions of that document. Similar results and result collapsing are unique to FAST Search Server 2010 for SharePoint and are not provided in SharePoint Server 2010 Search.

Contextual Search Capabilities

FAST Search Server 2010 for SharePoint enables you to associate Best Bets, Visual Best Bets, document promotions, document demotions, site promotions, and site demotions with defined user contexts in order to personalize the experience for information workers. You can use the FAST Search User Context link in the Site Collection Settings pages to define user contexts for these associations.

Relevancy Tuning by Document or Site Promotions

SharePoint Server 2010 enables you to identify varying levels of authoritative pages that help you tune relevancy ranking by site. FAST Search Server 2010 for SharePoint adds the ability for you to specify individual documents within a site for promotion or demotion and, furthermore, enables you to associate each promotion or demotion with user contexts.

Synonyms

SharePoint Server 2010 keywords can have one-way synonyms associated with them. FAST Search Server 2010 for SharePoint extends this concept by enabling you to implement both two-way and one-way synonyms. With a two-way synonym set of, for example, {auto car}, a search for "auto" would be translated into a search for "auto OR car," and a search for "car" would be translated into a search for "car OR auto." With a one-way synonym set of, for example, {car coupe}, a search for "car" would translate into a search for "car OR coupe," but a search for "coupe" would remain just "coupe."

SharePoint Server 2010 provides an address-book-style name lookup experience with name and expertise matching, making it easy to find people by name, title, expertise, and organizational structure. This includes phonetic name matching that will return names that sound similar to what the user has typed in a query. It will also return all variations of common names, including nicknames.

The refiners provided for the core search results are also provided with people search results; exploring results via name, title, and various fields in a user's profile enables quick browsing and selection of people. People search results also include real-time presence through Office Communication Server, making it easy to immediately connect with people once they are found through search.

Figure 10. People search results page

People search results page

 

The people and expertise finding capabilities with SharePoint Server 2010 are a dramatic enhancement over MOSS 2007. They are remarkably innovative and effective, and tie in nicely to the social computing capabilities covered in Chapter 5. The exact same capabilities are available with FAST Search for SharePoint.

Search Architecture and Topologies

The search architecture has been significantly enhanced with SharePoint Server 2010. The new architecture provides fault-tolerance options and scaling well beyond the limits of MOSS 2007 search (to 100M documents). Adding FAST provides even more flexibility and scale. Of course, these capabilities and flexibility add complexity. Understanding how search fits together architecturally will help you build applications that scale well and perform quickly.

SharePoint Search Key Components

Figure 11 provides an overview of the logical architecture for the Enterprise Search components in SharePoint Server 2010.

Figure 11. Architecture of the Enterprise Search components

Architecture of the Enterprise Search component

 

As shown in Figure 11, there are four main components that deliver the Enterprise Search features of SharePoint Server 2010:

  • Crawler: This component invokes connectors that are capable of communicating with content sources. Because SharePoint Server 2010 can crawl different types of content sources (such as SharePoint sites, other websites, file shares, Lotus Notes databases, and data exposed by Business Connectivity Services), a specific connector is used to communicate with each type of source. The crawler then uses the connectors to connect to and traverse the content sources, according to crawl rules that an administrator can define. For example, the crawler uses the file connector to connect to file shares by using the FILE:// protocol, and then traverses the folder structure in that content source to retrieve file content and metadata. Similarly, the crawler uses the web connector to connect to external websites by using the HTTP:// or HTTPS:// protocols and then traverses the web pages in that content source by following hyperlinks to retrieve web page content and metadata. Connectors load specific IFilters to read the actual data contained in files. Refer to the "Connector Framework" section later in this chapter for more information about connectors.

  • Indexer: This component receives streams of data from the crawler and determines how to store that information in a physical, file-based index. For example, the indexer optimizes the storage space requirements for words that have already been indexed, manages word breaking and stemming in certain circumstances, removes noise words, and determines how to store data in specific index partitions if you have multiple query servers and partitioned indexes. Together with the crawler and its connectors, the indexing engine meets the business requirement of ensuring that enterprise data from multiple systems can be indexed. This includes collaborative data stored in SharePoint sites, files in file shares, and data in custom business solutions, such as customer relationship management (CRM) databases, enterprises resource planning (ERP) solutions, and so on.

  • Query Server: Indexed data that is generated by the indexing engine is propagated to query servers in the SharePoint farm, where it is stored in one or more index files. This process is known as "continuous propagation;" that is, while indexed data is being generated or updated during the crawl process, the changes are propagated to query servers, where they are applied to the index file (or files). In this way, the data in the indexes on query servers experience a very short latency. In essence, when new data has been indexed (or existing data in the index has been updated), those changes will be applied to the index files on query servers in just a few seconds. A server that is performing the query server role responds to searches from users by searching its own index files, so it is important that latency be kept to a minimum. SharePoint Server 2010 ensures this automatically. The query server is responsible for retrieving results from the index in response to a query received via the Query Object Model. The query sever is also responsible for the word breaking, noise-word removal, and stemming (if stemming is enabled) for the search terms provided by the Query Object Model.

  • Query Object Model: As mentioned earlier, searches are formed and issued to query servers by the Query Object Model. This is typically done in response to a user performing a search from the user interface in a SharePoint site, but it might also be in response to a search from a custom solution (hosted in or out of SharePoint Server 2010). Furthermore, the search might have been issued by custom code, for example, from a workflow or from a custom navigation component. In any case, the Query Object Model parses the search terms and issues the query to a query server in the SharePoint farm. The results of the query are returned from the query server to the Query Object Model, and the object model provides those results to the user interface components (or other components that might have issued the query).

Figure 12 shows a process view of SharePoint Server Search. The Shared Service Application (SSA), new to SharePoint 2010, is used to provide a shareable and scalable service. A search SSA can work across multiple SharePoint farms, and is administered on a service level.

Figure 12. Process view of SharePoint Server search

Process view of SharePoint Server search

 

Search Topologies, Scaling, and High Availability

SharePoint Server 2010 enables you to add multiple instances of each of the crawler, indexing, and query components. This level of flexibility means that you can scale your SharePoint farms. (Previous versions of SharePoint Server did not allow you to scale the indexing components.)

The Enterprise Search in SharePoint Server 2010 is designed to provide sub-second query latencies for all queries, regardless of the size of your farm, and to remove bottlenecks that were present in previous versions of SharePoint Server. SharePoint Server 2010 lets you scale out every logical component in your search architecture, unlike previous versions.

As shown in Figure 13, the architecture provides scaling at multiple levels. You can add multiple crawlers to your farm to provide availability and to scale to achieve high performance for the indexing process. You can also add multiple query servers to provide availability and to scale to achieve high query performance. All components, including administration, can be fault tolerant and can take advantage of the mirroring capabilities of the underlying databases.

Figure 13. Scaling at multiple levels

Scaling at multiple levels

 

The crawlers handle indexing as well. Each crawler can crawl a discrete set of content sources, so not all indexers need to index the entire corpus. This is a new capability for SharePoint Server 2010. Crawlers are now stateless, so that one can take over the activity of another if it fails, and they use the crawl database to coordinate the activity of multiple crawlers. Indexers no longer store full copies of the index; they simply propagate the indexes to query servers. Crawling and indexing are I/O and CPU intensive; adding more machines increases the crawl/index throughput linearly. Since content freshness is determined by crawl frequency, adding resources to crawling can provide fresher content, too.

When you add multiple query servers, you are really implementing index partitioning; each query server maintains a subset of the entire logical index and, therefore, does not need to query the entire index (which could be a very large file) for every query. The partitions are maintained automatically by SharePoint Server 2010, which uses a hash of each crawled document’s ID to determine in which partition a document belongs. The indexed data is then propagated to the appropriate query server.

Another new feature is that property databases are also propagated to query servers so that retrieving managed properties and security descriptors is much more efficient than in Microsoft Office SharePoint Server 2007.

High Availability and Resiliency

Each search component fulfills high-availability requirements by supporting mirroring. Figure 14 shows a scaled-out and mirrored architecture, sized for 100M documents. SQL Server mirroring is used to keep multiple instances synchronized across geographic boundaries. In this example, each of the six query processing servers serves results from a partition of the index and also acts as a failover for another partition. The two crawler servers provide throughput (multiple crawlers) as well as high availability; if either crawler server fails, the crawls continue.

Figure 14. Scaled out and mirrored architecture

Scaled out and mirrored architecture

 

As with any multi-tier system, understanding the level of performance resiliency you need is the starting point. You can then engineer for as much capacity and safety as you need.

FAST Architecture and Topology

FAST Search for SharePoint shares many architectural features of SharePoint Server 2010 search. It uses the same basic layers (crawl, index, query) architecturally. It uses the same crawler and query handlers, and the same people and expertise search. It uses the same OMs and the same administrative framework.

However, there are some major differences. FAST Search for SharePoint adds on to SharePoint server in a hybrid architecture (see Figure 2). This means that processing from multiple farms is used to make a single system. Understanding what processing happens in what farm can be confusing; remembering the hybrid approach, with common crawlers and Query OM but separate people and content search, is key to understanding the system configuration. Figure 15 shows a high-level mapping of processing to farms. Light grey represents the SharePoint farm, medium grey represents the FAST backend farm, and dark grey represents other systems, such as the System Center Operations Manager (SCOM).

Figure 15. High-level mapping of processing to farms

High level mapping of processing to farms

 

SharePoint 2010 provides shared service applications (SSAs) to serve common functions across multiple site collections and farms. SharePoint Server 2010 search uses one SSA (see Figure 12). FAST Search for SharePoint uses two SSAs: the FAST Query SSA and the FAST Content SSA. This is a result of the hybrid architecture (shown in Figure 2); SharePoint servers provide people search and FAST servers provide content search. Both SSAs run on SharePoint farms and are administered from the SharePoint 2010 central administration console.

The FAST Query SSA handles all queries and also serves people search. If the queries are for content search, it routes them to a FAST Query Service (which resides on a FAST farm). Routing uses the default service provider property, or overrides this if you explicitly set a provider on the query request. The FAST Query SSA also handles crawling for people search content.

The FAST Content SSA (also called the FAST Connector SSA) handles all the content crawling that goes through the SharePoint connectors or connector framework. It feeds all content as crawled properties through to the FAST farm (specifically a FAST content distributor), using extended connector properties. The FAST Content SSA includes indexing connectors that can retrieve content from any source, including SharePoint farms, internal/external web servers, Exchange public folders, line-of-business data, and file shares.

The FAST farm (also called the FAST backend) includes a Query Service, document processors that provide advanced content processing, and FAST-specific indexing connectors used for advanced content retrieval. Configuration of the additional indexing connectors is performed via XML files and through Windows PowerShell cmdlets or command-line operations, and are not visible via SharePoint Central Administration.

Figure 16. Overview of SSAs in the search architecture

Overview of SSAs in the search architecture

 

The use of multiple SSAs to provide for one FAST Search for SharePoint system is probably the most awkward aspect of FAST Search for SharePoint and the area of the most confusion. In practice, this is pretty straightforward, but you need to get your mind around the hybrid architecture and keep this in mind when you are architecting or administering a system. As a developer, you also have to remember this when you are using the Administrative OM.

Scale-Out with FAST

FAST Search for SharePoint is built on a highly modular architecture where the services can be scaled individually to achieve the desired performance. The architecture of FAST Search for SharePoint uses a row and column approach for core system scaling, as shown in Figure 17.

Figure 17. FAST Search for SharePoint core system scaling

FAST Search for SharePoint core system scaling

 

This architecture provides both extreme scale and fault tolerance with respect to:

  • Amount of indexed content: Each column handles a partition of the index, which is kept as a file on the file system (unlike SharePoint Server search index partitions, which are held in a database). By adding columns, the system can scale linearly to billions of documents.

  • Query load: Each row handles a set of queries; multiple rows provide both fault tolerance and capacity. An extra row provides full fault tolerance, so if an application required four rows for query handling, a fifth row would provide fault tolerance. (For most inside-the-firewall implementations, a single row provides plenty of query capacity.)

  • Freshness (indexing latency): FAST Search for SharePoint enables you to optimize for low latency from the moment a document is changed in the source repository to the moment it is searchable. This can be done by proper dimensioning of the crawling, item processing, and indexing to fulfill your requirements. These three parts of the system can be scaled independently through the modular architecture.

Figure 18 shows an example of a FAST Search for SharePoint topology, with full fault tolerance, sized for roughly sixty million documents.

Figure 18. Example FAST Search for SharePoint topology

Example FAST Search for SharePoint topology

 

This example includes both the SharePoint Server farm and the FAST Search backend farm. Because the connector framework is the same, crawling scale out and redundancy are the same as with SharePoint Server 2010 Search unless FAST-specific connectors are in use. The query-mirroring approach is the same as with SharePoint Server Search, except that content queries are processed very lightly before handing off to FAST, so query capacity per machine or VM is much higher for the SharePoint servers. The center layer is a farm of FAST Search servers, in a row-column architecture, which provides both scaling and fault tolerance.

How Architecture Meets Applications

Capacity planning, scaling, and sizing are usually the domain of the IT pro; as a developer, you need only be aware that the architecture supports a much broader range of performance and availability than MOSS 2007. You can tackle the largest, most demanding applications without worrying that your application will not be available at extreme scale.

Architecture is also important for applications that control configuration and performance. You might want to set up a specific recommended configuration or implement self-adjusting performance based on the current topology, load, and performance. The architecture supports adding new processing on the fly; in fact, the central administration console makes it easy to do so. This means that your applications can scale broadly, ensure good performance, and meet a broad range of market needs.

Developing search-powered applications has been a difficult task. Even though search is simple on the outside, it is complicated on the inside. With SharePoint 2010, developers have a development platform that is much more powerful and simpler to work with than MOSS 2007. That fact extends to search-based applications as well. Through a combination of improvements to the ways in which developers can collect data from repositories, query that data from the search index, and display the results of those queries, SharePoint Server 2010 offers a variety of possibilities for more powerful and flexible search applications that access data from a wide array of locations and repositories.

There are many areas where development has become simpler; for example, where you can cover with configuration what you used to do with code, or where you can do more with search. The new connector framework provides a flexible standard for connecting to data repositories through managed code. This reduces the amount of time and work required to build and maintain code that connects to various content sources. Enhanced keyword query syntax makes it easier to build complex queries by using standard logical operators, and the newly public Federated Search runtime object model provides a standard way of invoking those queries across all relevant search locations and repositories. The changes enable a large number of more complex interactions among Search web parts and applications, and ultimately a richer set of tools for building search result pages and search-driven features.

Range of Customization

Customization of search falls into three main categories, as shown in Figure 19:

  • Configure: Using configuration parameters alone, you can set up a tailored search system. Usually, you are working with web part configuration, XML, and PowerShell. Most of the operations are similar to what IT pros use in administering search, but packaged ahead of time by you as a developer.

  • Extend: Using the SharePoint Designer, XSLT, and other "light" development, you can create vertical and role-specific search applications. Tooling built into SPD lets you build new UIs and new connectors without code.

  • Create: Search can do amazing things in countless scenarios when controlled and integrated using custom code. Visual Studio 2010 has tooling built in, which makes developing applications with SharePoint much easier. In many of these scenarios, search is one of many components in the overall application.

Figure 19. Search customization categories

Search customization categories

 

There are no hard rules here. General-purpose search applications, such as intranet search, can benefit from custom code and might be highly customized in some situations, even though intranet search works with no customization at all. However, most customization tends to be done on special-purpose applications with a well-identified set of users and a specific set of tasks they are trying to accomplish. Usually, these are the most valuable applications as well. Customization is well worth it for these cases.

Top Customization Scenarios

Although there are no hard rules, there are common patterns found when customizing Enterprise Search. The most common customization scenarios are:

  • Modify the end user experience: To create a specific experience and/or surface specific information. Examples: add new refinement category, show results from federated location, modify the look and feel of the OOB end user experience, enable sorting by custom metadata, add a visual Best Bet for upcoming sales event, configure different rankings for the human resources and engineering departments.

  • Create a new vertical search application: For a specific industry or role. Examples: reaching and indexing specific new content, designing a custom search experience, adding Audio/Video/Image search.

  • Create new visual elements: Add to the standard search. Examples: show location refinement on charts/maps, show tags in a tag cloud, enable "export results to a spreadsheet," summarize financial information from customers in graphs.

  • Query and Result pipeline plug-ins: Used to process questions and answers in more sophisticated ways. Example: create a new "single view of the customer" application that includes customer contact details, customer project details, customer correspondence, internal experts, and customer-related documents.

  • Query and Indexing shims: Add terms and custom information to the search experience. Examples: expand query terms based on synonyms defined in the term store, augment customer results with project information, show popular people in line with search results, or show people results from other sources. Both the Query OM and the connector framework provide a way to write "shims," which are simple extensions of the .NET assembly where a developer can easily add custom data sources and/or do data mash-ups.

  • Create new search-driven sites and applications: Create customized content exploration experiences. Examples: show email results from personal mailbox on Exchange Server through Exchange Web Services (EWS), index content from custom repositories like Siebel, create content-processing plug-ins to generate new metadata.

Search-Driven Applications

Search is generally not well understood or fully used by developers who are building significant applications. SharePoint 2010 will, hopefully, change all that. By making it easier to own and use high-end search capabilities, and by including tooling and hooks specifically for application development, Microsoft has taken a big step forward in helping developers do more with search.

Figure 20 lists some examples of search-driven applications, and shows a screenshot of one of them. These are applications like any other, except that they take advantage of search technology, in addition to other elements of SharePoint, to create flexible and powerful user experiences.

Figure 20. Search can drive an application

Search can drive an application

 

The rest of this chapter covers different aspects of search with SharePoint 2010, highlighting how you can customize them and how you can include them in search-driven applications.

Customizing the Search User Experience

While the out-of-the-box user interface is very intuitive and useful for information workers, power users can create their own search experiences. SharePoint Server 2010 includes many search-related web parts for power users to create customized search experiences, including Best Bets, refinement panel extensions, featured content, and predefined queries.

Figure 21. Standard Search web parts

Standard Search web parts

 

IT pros or developers can configure the built-in search web parts to tailor the search experience. As a developer, you can also extend the web parts, to change the behavior of built-in web parts on search results pages. Instead of building new web parts, developers can build onto the functionality of existing ones.

In addition, query logging is now available from customized search web parts, and from any use of the Query object to query the Search Service.

Example: New Core Results Web Part

Let's walk you through the creation of a new search web part in Visual Studio 2010. (The full code is included with Code Project 6-P-1, and is courtesy of Steve Peschka.) This web part inherits from the CoreResultsWebPart class and displays data from a custom source. The standard CoreResultsWebPart part includes a constructor and then two methods that we will modify in this example.

The first step is to create a new WebPart class. Create a new web part project that inherits from the CoreResultsWebPart class. Override CreateChildControls to add any controls necessary for your interface, and then override CreateDataSource. This is where you get access to the "guts" of the query. In the override, you create an instance of a custom datasource class that you build.

class MSDNSample : CoreResultsWebPart
    {
        
        public MSDNSample()
        {
            //default constructor;        }
        
        protected override void CreateChildControls()
        {
            base.CreateChildControls();
        
            //add any additional controls needed for your UI here
        }
        
        protected override void CreateDataSource()
        {
            //base.CreateDataSource();
            this.DataSource = new MyCoreResultsDataSource(this);
        }

The second step is to create a new CoreResultsDatasource class. In the override for CreateDataSource, set the DataSource property to a new class that inherits from CoreResultsDataSource. In the CoreResultsDataSource constructor, create an instance of a custom datasource view class that you will build. No other overrides are necessary.

public class MyCoreResultsDataSource : CoreResultsDatasource
        {
            public MyCoreResultsDataSource(CoreResultsWebPart ParentWebpart)
                : base(ParentWebpart)
            {
                //to reference the properties or methods of the web part
                //use the ParentWebPart parameter
        
                //create the View that will be used with this datasource
                this.View = new MyCoreResultsDataSourceView(this,"MyCoreResults");
            }
        }

The third step is to create a new CoreResultsDatasourceView class. Set the View property for your CoreResultsDatasource to a new class that inherits from CoreResultsDatasourceView. In the CoreResultsDatasourceView constructor, get a reference to the CoreResultsDatasource so that you can refer back to the web part. Then, set the QueryManager property to the shared query manager used in the page.

public class MyCoreResultsDataSourceView : CoreResultsDatasourceView
        {
        
             public MyCoreResultsDataSourceView       (SearchResultsBaseDatasource DataSourceOwner, string ViewName)
                     : base(DataSourceOwner, ViewName)
            {
                //make sure we have a value for the datasource
                if (DataSourceOwner == null)
                {
                    throw new ArgumentNullException("DataSourceOwner");
                }
        
                //get a typed reference to our datasource
                 MyCoreResultsDataSource ds =       this.DataSourceOwner as MyCoreResultsDataSource;
        
                //configure the query manager for this View

                 this.QueryManager = SharedQueryManager.GetInstance       (ds.ParentWebpart.Page).QueryManager;
            }

You now have a functional custom web part displaying data from your custom source. In the next example, we take things one step further to provide some custom query processing.

Example: Adding Sorting to Your New Web Part

The CoreResultsDataSourceView class lets you modify virtually any aspect of the query. The primary way to do that is in an override of AddSortOrder. This class provides access to SharePointSearchRuntime class, which includes: KeywordQueryObject, Location, and RefinementManager.

The following code example adds sorting by overriding AddSortOrder. (The full code is included with Code Project 6-P-1, courtesy of Steve Peschka.)

public override void AddSortOrder(SharePointSearchRuntime runtime)
            {
                #region Ensure Runtime
                //make sure our runtime has been properly instantiated
                if (runtime.KeywordQueryObject == null)
                {
                    return;
                }
                #endregion
        
                //remove any other sorted fields we might have had
                runtime.KeywordQueryObject.SortList.Clear();
        
                 //get the datasource so we can get to the web part
                //and retrieve the sort fields the user selected
                 SearchResultsPart wp =      this.DataSourceOwner.ParentWebpart as SearchResultsPart;
                string sortField = wp.SortFields;
        
                //check to see if any sort fields have been provided
                if (!string.IsNullOrEmpty(sortField))
                {
                    //if posting back, then use the value from the sort drop-down
                    if (wp.Page.IsPostBack)
                    {
                        //get the sort direction that was selected
                        SortDirection dir =
                             (wp.Page.Request.Form[SearchResultsPart  .mFormSortDirection] ==   "ASC" ?
                            SortDirection.Ascending : SortDirection.Descending);
        
                        //configure the sort list with sort field and direction
                         runtime.KeywordQueryObject.SortList.Add    (wp.Page.Request.Form[SearchResultsPart.mFormSortField],
                            dir);
                    }

                    else
                    {
                        //split the value out from its delimiter and
                        //take the first item in descending order
                         string[] values = sortField.Split(";".ToCharArray(),     StringSplitOptions.RemoveEmptyEntries);
                         runtime.KeywordQueryObject.SortList.Add(values[0],     SortDirection.Descending);
                    }
                }
                else  //no sort fields provided so use the default sort order
                    base.AddSortOrder(runtime);

The KeywordQueryObject class is what is used in this scenario. It provides access to key query properties like the following:

  • EnableFQL

  • EnableNicknames

  • EnablePhonetic

  • EnableStemming

  • Filter

  • QueryInfo

  • QueryText

  • Refiners

  • RowLimit

  • SearchTerms

  • SelectProperties

  • SortList

  • StartRow

  • SummaryLength

  • TrimDuplicates

  • ...and many more

To change the sort order in your web part, first remove the default sort order. Get a reference to the web part, as it has a property that has the sort fields. If the page has been posted back, then get the sort field the user selected. Otherwise, use the first sort field the user selected. Finally, add the sort field to the SortList property.

To allow sorting, you also must provide fields on which to sort. Ordering can be done with DateTime fields, Numeric fields, or Text fields where:

HasMultipleValues = false, IsInDocProps = true, and MaxCharactersInPropertyStoreIndex > 0.

You can limit the user to only selected fields by creating a custom web part property editor. This uses the same process as in SharePoint 2007: inherit from EditorPart and implement IWebEditable. The custom version of EditorPart in this example web part uses a standard LINQ query against the search schema to find properties.

Web Parts with FAST

SharePoint search and FAST Search for SharePoint share the same UI framework. When you install FAST Search for SharePoint, the same Search Centers and Small Search Box web parts apply; the main Result web part and Refiner web part are replaced with FAST-specific versions, and a Search Visual Best Bets web part is added. Otherwise, the web parts (like the Related Queries web part or Federated Results web part) remain the same.

Because of the added capabilities of FAST, there are some additional configuration options. For example, the core results web part allows for configuration of thumbnails and scrolling previews; that is, whether to show them or not, how many to render, and so forth. The search Action Links web part provides configuration of the sorting pull-down (which can also be used to expose multiple ranking profiles to the user). The Refinement web part has additional options, and counts are returned with refiners (since they are deep refiners over the whole result set).

The different web parts provided with FAST Search for SharePoint and the additional configuration options are fairly self-evident when you look at the web parts and their documentation. Since most web parts are now public with SharePoint 2010, you can look at them directly and see the available configuration options within Visual Studio.

Search Connectors and Searching LOB Systems

Acquiring content is essential for search; if it is not crawled, you cannot find it! Typical enterprises have hundreds of repositories of dozens of different types. Bridging content silos in an intuitive UI is one of the primary values of search applications. SharePoint 2010 supports this through a set of pre-created connectors, plus a framework and set of tools that make it much easier to create and administer connectivity to whatever source you like. There is already a rich set of partner-built connectors to choose from, and as a developer, you can easily leverage these or add to them.

SharePoint Server 2010 will support existing protocol handlers (custom interfaces written in unmanaged C++ code) used with MOSS 2003 and MOSS 2007. However, indexing connectors are now the primary way to create interfaces to data repositories. The Connector Framework uses .NET assemblies, and supports the Business Connectivity Services (BCS) declarative methodology for creating and expressing connections. It also enables connector authoring by means of managed code. This increased flexibility, with enhanced APIs and a seamless end-to-end experience for creating, deploying, and managing connectors, makes the job of collecting and indexing data considerably easier.

A number of productized connectors included with SharePoint Server 2010 provide built-in access to some of the most popular types of data repositories (including SharePoint sites, websites, file shares, Exchange public folders, Documentum instances, and Lotus Notes databases). The same connectors can be configured to work with a wide range of custom databases and web services (via BCS). For complex repositories, custom code lets you access line-of-business data and make it searchable.

Search leverages Business Connectivity Services (BCS) heavily in this wave. (See Chapter 11 for more information about BCS.) BCS is a set of services and features that provide a way to connect SharePoint solutions to sources of external data and to define External Content Types that are based on that external data. External Content Types allow the presentation of and interaction with external data in SharePoint lists (known as external lists), web parts, Microsoft Outlook 2010, Microsoft SharePoint Workspace 2010, and Microsoft Word 2010 clients. External systems that Microsoft Business Connectivity Services can connect to include SQL Server databases, SAP applications, web services (including Windows Communication Foundation web services), custom applications, and websites based on SharePoint. By using Microsoft Business Connectivity Services, you can design and build solutions that extend SharePoint collaboration capabilities and the Office user experience to include external business data and the processes that are associated with that data.

Microsoft Business Connectivity Services solutions use a set of standardized interfaces to provide access to business data. As a result, developers of solutions do not have to learn programming practices that apply to a specific system or adapter for each external datasource. Microsoft Business Connectivity Services also provide the runtime environment in which solutions that include external data are loaded, integrated, and executed in supported Office client applications and on the web server. Enterprise Search uses these same practices and framework, and connectors can surface information in SharePoint that is synchronized with the external line-of-business system, including writing back any changes. Search connectors can use other BCS features, such as external lists.

New Connector Framework Features

The connector framework, shown in Figure 22, provides improvements over the protocol handlers in previous versions of SharePoint Server. For example, connectors can now crawl attachments, as well as the content, in email messages. Also, item-level security descriptors can now be retrieved for external data exposed by Business Connectivity Services. Furthermore, when crawling a Business Connectivity Services entity, additional entities can be crawled via its entity relationships. Connectors also perform better than previous versions of protocol handlers, by implementing concepts such as inline caching and batching.

Figure 22. Connector framework

Connector framework

 

Connectors support richer crawl options than the protocol handlers in previous versions of SharePoint Server. For example, they support the full crawl mode that was implemented in previous versions, and they support timestamp-based incremental crawls. However, they also support change log crawls that can remove items that have been deleted since the last crawl.

Creating Indexing Connectors

In previous versions of SharePoint Server, it was very difficult to create protocol handlers for new types of external systems. Protocol handlers were required to be coded in unmanaged C++ code and typically took a long time to test and stabilize.

With SharePoint Server 2010, you have many more options for crawling external systems. You can choose to do the following:

  • Use SharePoint Designer 2010 to create external content types and entities for databases or web services and then simply crawl those entities.

  • Use Visual Studio 2010 to create external content types and entities for databases or web services, and then simply crawl those entities.

  • Use Visual Studio 2010 to create .NET types for Business Connectivity Services (typically for backend systems that implement dynamic data models, such as document management systems), and then use either SharePoint Designer 2010 or Visual Studio 2010 to create external content types and entities for the .NET type.

Note

You can still create protocol handlers (as in previous versions of SharePoint Server) if you want to. However, it is better to use the new connector framework instead.

Model Files

Every indexing connector needs a model file (also called an application definition file) to express connection information and the structure of the backend, and a BCS connector for code to execute when accessing the backend (also called a "shim"). The model file tells the search indexer what information from the repository to index and any custom-managed code that developers determine they must write (after consulting with their IT and database architects). The connector might require, for example, special methods for authenticating to a given repository and other methods for periodically picking up changes to the repository.

You can use OOB shims with the model file or write a custom shim. Either way, the deployment and connector management framework makes it easy. Crawling content is no longer an obscure art. SharePoint 2010 also has great tooling support for connectors.

Tooling in SPD and VS2010

Both SharePoint Designer 2010 and Visual Studio 2010 have tooling that manages authoring connectors. You can use SharePoint Designer to create model files for out-of-box BCS connectors (such as a database), to import and export model files between BCS services applications, and to enable other SharePoint workloads, such as External Lists. Use Visual Studio 2010 to implement methods for the .NET shim or to write custom shims for your repository.

When you create a model file through SharePoint Designer, it is automatically configured for full-fidelity performance crawling. This takes advantage of features of the new connector framework, including inline caching for better citizenship, and timestamp-based incremental crawling. You can specify the search click-through URL to go to the profile page, so that content includes writeback, integrated security, and other benefits of BCS. Crawl management is automatically enabled through the Search Management console.

Figure 23 shows the relationships between the elements that are most commonly changed when creating a new connector using SharePoint Designer and OOB shims.

Figure 23. Relationships between elements

Relationships between elements

 

Figure 24. Configuration panel in SharePoint Designer

Configuration panel in SharePoint Designer

 

Writing Custom Connectors

Now let's walk through creating an example of a connector with a custom shim. Assume that you have a product catalog in an external system and want to make it searchable. Code Project 6-P-2 shows the catalog schema and walks through this example step by step.

There are two types of custom connectors: a managed .NET Assembly BCS connector and a custom BCS connector. In this case, we use the .NET BCS connector approach. We need to create only two things: the URL parsing classes, and a model file.

The code is written with .NET classes and compiled into a Dynamic Link Library (DLL). Each entity maps to a class in the DLL, and each BDC operation in that entity maps to a method inside that class. Once the code is done and the model file is uploaded, you can register the new connector either by adding DLLs to the global assembly cache (GAC) or by using PowerShell cmdlets to register the BCS connector + model file. Configuration of the connector is then available through the standard UI; the content sources, crawl rules, managed properties, crawl schedule, and crawl logs work as they do in any other repository.

If you build a custom BCS connector, you implement the ISystemUtility interface for connectivity. For URL mapping, you implement the ILobUri and INamingContainer interfaces. Compile the code into a DLL and add DLL to the GAC, author a model file for the custom backend, register the connector by using PowerShell, and you are done! The SharePoint Crawler invokes the Execute() method in the ISystemUtility class (as implemented by the custom shim), so you can put your special magic into this method.

A Few More Tips

The new connector framework takes care of a lot of things for you. There are a couple more new capabilities you might want to take advantage of:

  • To create item-level security, implement the GetSecurityDescriptor() method. For each entity, add a method instance property.

    <Property Name = "WindowsSecurityDescriptorField" Type ="System.Byte[]"> Field name </Property>

  • To crawl through entity associations, that is, association navigators (foreign key relationships), add the following property.

    <Property Name="DirectoryLink" Type="System.String"> NotUsed </Property>

Deploying Connectors

Developers and administrators use the Windows SharePoint Services 3.0 solutions framework to deploy connectors. After authoring a solution, the developer creates a CAB (.cab) file that combines the application's definition file and the solution code. An administrator or a developer then creates a Windows SharePoint Services 3.0 solutions management consumable package, a manifest file that contains the CAB file, connection information, and other resources. When the CAB file is available, the administrator uses the Windows SharePoint Services Stsadm command-line tool to upload the file, placing the CAB file into the configuration database of the server farm. Then, the administrator deploys the solution in the Windows SharePoint Services solutions management interface. This step also registers the solution and puts its DLLs in the global assembly cache of all the index servers.

After the connector is installed, the associated repository can be managed and crawled via the Content Source type list in the administration UI.

FAST-Specific Indexing Connectors

The connector framework and all of the productized connectors work with FAST Search for SharePoint as well as SharePoint Server search. FAST also has two additional connectors.

The Enterprise crawler provides web crawling at high performance with more sophisticated capabilities than the default web crawler. It is good for large-scale crawling across multiple nodes and supports dynamic data, including JavaScript.

The Java Database Connectivity (JDBC) connector brings in content from any JDBC-compliant source. This connector supports simple configuration using SQL commands (joins, selects, and so on.) inline. It supports push-based crawling, so that a source can force an item to be indexed immediately. The JDBC connector also supports change detection through checksums, and high-throughput performance.

These two connectors don't use the connector framework and cannot be used with SharePoint Server 2010 Search. They are FAST-specific and provide high-end capabilities. You don't have to use them if you are creating applications for FAST Search for SharePoint, but it is worth seeing if they apply to your situation.

Customizing Connectivity in Summary

Using OOB shims (Database/WCF/.NET) is very straightforward with SharePoint 2010. This is recommended if the backend structure is static.

  • Create/deploy the model file using SPD and use the search UI to configure crawls.

  • Create/deploy .NET classes using Visual Studio and use the search UI to configure crawls.

Writing a custom shim and a Model file is the best approach for cases with dynamic backend structures. One example of this is the Exchange public folders. This approach also provides a cleaner integration with search user interface.

Working with Federation

In addition to indexing information, search can present information to the user via federation. This is a "scatter-gather" approach: the same query is sent to a variety of different places, and the results are displayed together on the same page. Federation is not a replacement for indexing, but it is an essential tool for situations in which indexing is impossible (web search engines have the whole web covered; you don't have the storage or computer power to keep up with that) or impractical (you have an existing vertical search application that you don't want to touch). Federation can also be a great mechanism for migration. Figure 25 shows some of the situations where you might use indexing and federation. Microsoft has embraced federation wholeheartedly, in particular the OpenSearch standard.

Figure 25. When to use indexing and when to use federation

When to use indexing and when to use federation

 

Microsoft began supporting OpenSearch in 2008 with the introduction of Search Server 2008. Now all of Microsoft's Enterprise Search products support OpenSearch, and all of them have implemented comprehensive support for federation with out-of-the-box federation connectors to a range of search interfaces. Federation is built to be easy to set up, taking less than five minutes for an administrator to add a federated connector and see federated results appear in search queries. Further flexibility and control over the use of federated connectors come from triggering, presentation, and security features. Enterprise Search offerings can act as OpenSearch providers, OpenSearch clients, or both.

OpenSearch is a standard for search federation, originally developed by Amazon.com for syndicating and aggregating search queries and results. The operation of OpenSearch is shown in Figure 26. It is a standard that is used throughout the industry. The basic operation involves a search Client, which could be a desktop (Windows 7), a browser (Internet Explorer 8), or a server (SharePoint 2010). It also involves a Search Provider, which is any server with a searchable RSS feed, meaning that it accepts a query as a URL parameter and returns results in RSS/Atom.

Figure 26. OpenSearch operation

OpenSearch operation

 

OpenSearch is now supported by a broad community (see Opensearch.org) and is in common use among online information service providers (such as Bing, Yahoo!, Wikipedia, and Dow Jones-Factiva). It is becoming more and more common in business applications. Following Microsoft's introduction of OpenSearch into its Enterprise Search products, partners built OpenSearch connectors to applications such as EMC Documentum, IBM FileNet, and OpenText Hummingbird.

Microsoft Search Server 2008 supported OpenSearch and Local Index Federation. It included a federation administration UI and several Federation web parts, but federation was a bit of a side capability. The main Results web part, for example, couldn't be configured to work with federation.

With SharePoint Server 2010, all web parts are built on the Federation OM. Connections to Windows 7, Bing, IE8, and third-party clients are built in. FAST Search for SharePoint supports federation in the same way, and the Federation OM is now public — so you can create your own type of federated connector!

Customization Examples Using Federation

The following code example shows a custom OpenSearch provider (the full code is included with Code Project 6-P-2). This code creates a simple RSS feed from the result of a database query.

resultsXML.Append("<rss version=\"2.0\"    xmlns:advworks=\"http://schemas.adventureworks.com/Products/Search/RSS\"    xmlns:media=\"http://search.yahoo.com/mrss/\">");
resultsXML.Append("<channel>");
resultsXML.AppendFormat("<title>Adventure Works: {0}</title>", queryTerm);
resultsXML.AppendFormat("<link>{1}?q={0}</link>", queryTerm, RSSPage);
resultsXML.Append("<description>Searches Products in the Adventure Works database. </description>");
while (sqlReader.Read())
{
   ...
   resultsXML.Append("<item>");
   resultsXML.AppendFormat("<title>{0}</title>", sqlReader[0]);
   resultsXML.AppendFormat("<link>{1}?v={0}&amp;q={2}</link>", sqlReader[1],
     RSSPage, query);
   resultsXML.AppendFormat("<description>{0}       ({1}) has {2} units of inventory and will need to order more at {3} units.      </description>", sqlReader[0],
   sqlReader[1], sqlReader[2], sqlReader[4]);
   ...
   resultsXML.Append("</item>");
}
resultsXML.Append("</channel></rss>");

The behavior of this is described in an OSDX file, which is shown below. An OSDX file is simple XML, and clients like Windows 7 can incorporate this with one click. Of course, SharePoint 2010 also acts as an OpenSearch client (as well as an OpenSearch provider).

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription
xmlsn:ms-ose="https://schemas.microsoft.com/opensearchext/2009/"    xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>ProductsSearch</ShortName>
  <Description>Searches the Adventure Works Products database.</Description>
   <Url type="text/html" template=   "http://demo/sites/advsearch-prod/Pages/productresults.aspx?k={searchTerms} "/>
   <Url type="application/rss+xml" template=   "http://demo/_layouts/adventureworks/productsearch.aspx?q={searchTerms}"/>
</OpenSearchDescription>

Further Considerations in Federation

There are a number of additional things to remember when using federation. First, ranking is up to the provider, so mixing results is not as dependable as you might think. Simple mixers that use round-robin results presentation are okay for situations in which all the sources are of the same type and strong overall relevance ranking is not crucial. Second, OpenSearch does not support refinement OOB — use custom runtime code and OpenSearch extensions to pass refiners if you need to. You might want to translate the query syntax to match a given source system. Use a custom web part or runtime code for that. Security also needs special handling with federation; there is nothing built into OpenSearch. Microsoft has provided extensions to OpenSearch and a framework that handles security on a wide range of authentication protocols. Implementing this, however, requires you to be aware of the security environments your application will run in.

When you design an application using federation, plan out synchronous and asynchronous federation approaches. If the federation is synchronous, it is only as strong as its weakest link; results will be returned only when the slowest system comes back, and relevance ranking will be worse than the worst system involved. If federation is asynchronous, pay careful attention to the number of different result sets and how they are laid out on the UI. If you want to make your solution available via desktop search, this is easy with Windows 7, and it works out of the box with standard SharePoint or FAST Search. You do this by creating an OpenSearch Description (.osdx) file, which can then be deployed to Windows 7 via Group Policy if you like.

We have noted a few common federation design patterns. The federation-based search vertical application would focus on using federation with core results to provide a complete results experience. A lightweight preview of results, in contrast, would show a few (~three) results to preview a source. "Instant answer across multiple sources" is supported by the top Federated Results web part, which is useful for finding an exact match or quick factoid. Last, a custom application using the Federation OM might use query alteration, refinement, and query steering across multiple sources.

Federation is a powerful tool in your arsenal, and SharePoint 2010 has made it easy to use it. It is not a panacea; if you can pragmatically index content, this is nearly always better. However, using the Federation OM and building OpenSearch providers can help in many situations.

Working with the Query OM

Query processing is an essential part of search. Since effective search depends on getting good queries from the user, query processing is often used to improve the queries, by adding context or doing pre-processing. An example is location-aware searches, where the user is looking for results within a preferred distance of a particular location, and the location might be taken from the user's context (such as a GPS coordinate in a mobile phone). Query-side processing can be used to examine search results as they return and trigger more searches based on their contents. There is a huge range of things that you can do using the SharePoint Query OM, and some very exciting applications you can build with it.

Query-Side APIs and OMs

Figures 27 and 28 show the "stack" with query-side APIs and OMs, for SharePoint search and for FAST Search for SharePoint, respectively. In these figures, light grey components are on

SharePoint Server or FAST Search backend farms, and dark grey components are on other servers. Content flow is also shown in these figures, so that you can see how the whole system fits together.

Figure 27. Stack for SharePoint search

Stack for SharePoint search

 

Figure 28. Stack for FAST search

Stack for FAST search

 

It is important to understand the different ways you can access queries and results, so these next sections go through each of the query-side OMs.

The Federation Object Model (OM)

This is a new search object model in SharePoint 2010. It provides a unified interface for querying against different locations (search providers) and giving developers of search-driven web parts a way to implement end-user experiences that are independent of the underlying search engine. The object model also allows for combining and merging results from different search providers. Out-of-box web parts in SharePoint 2010 are based on this OM, and SharePoint 2010 ships with three different types of locations: SharePoint Search, FAST Search, and OpenSearch. The Federation OM is also extensible, should you want or need to implement a custom search location outside of the supported types.

The Federated Search runtime object model is now public, enabling developers to build custom web parts that search any federated location. This change, combined with richer keyword query syntax, provides a common and flexible interface for querying internal and external locations. The Federated Search Object Model now provides a consistent way to perform all queries from custom code, making it easier to write clean, reusable code.

An important enhancement of the Federated Search Object Model is the public QueryManager class, which makes it possible to customize the query pipeline. For example, developers can build a web part that passes search results from a given location or repository to other web parts. A single query can, therefore, serve multiple web parts.

The Query Web Service

This is the integration point for applications outside your SharePoint environment, such as standalone, non-web-based applications or Silverlight applications running in a browser. The Query web service is a SOAP-based web service and supports a number of operations, including the following.

  • Querying and getting search results

  • Getting query suggestions

  • Getting metadata (managed properties)

The same schema is shared by SharePoint Search and FAST Search, and both products support the same operations. For querying, clients can easily switch the search provider by setting a ResultsProvider element in the request XML. A number of extensions are available for FAST Search, for example, refinement results, advanced sorting using a formula, and issuing queries using the FAST Query Language.

The Query RSS Feed

Certain scenarios, such as simple mash-ups, may need only a simple search result list. The RSS feed is an alternative, lightweight integration point for supplying applications outside of SharePoint with a simple RSS result list. The Search Center, the default search frontend in SharePoint 2010, includes a link to a query-based RSS feed. Switching the engine to the RSS format is done simply by setting a URL provider. Because it was designed to be simple, there are some limitations to what can be returned and customized in the Query RSS feed. The user object models or web service integration scenarios are recommended for more advanced applications.

The Query Object Model

This is the lowest-level object model, used by the Federation Object Model, the Query web service, and the Query RSS feed. Both SharePoint Search and FAST Search support the KeywordQuery object in this object model. While the Federation OM returns XML (to web parts), the Query OM returns data types.

Figure 6-29 shows the newly customizable pipeline for queries that originate from SharePoint Server 2010. All objects in the figure can be customized with the exception of the rightmost one, Query Processing, which cannot be customized.

Figure 29. Customizing SharePoint Server 2010 queries

Customizing SharePoint Server 2010 queries

 

Query Syntax

The federation and Query OM are the methods for submitting queries. The queries themselves are strings that you construct and pass to the Search Service. A query request from a query client normally contains the following main parts:

  • The user query: This consists of the query terms that the user types into a query box found on the user interface. In most cases, the user simply types one or more words, but the user query may also include special characters like "+" and "-". The user query will normally be treated as a string that is passed transparently by the query client on the interface.

  • Property filters: These are additional constraints on the query that are added by the query client to limit the result set. These may include filters limiting the results by creation date, file type, written language, or any other metadata associated with the indexed items.

  • Query features and options: These are additional query parameters that specify how a query is executed and how the query result is to be returned. This includes linguistic options, refinement options, and relevancy options.

Search in SharePoint supports four types of search syntax for building search queries:

  • KQL (Keyword Query Language) syntax (search terms are passed directly to the Search Service)

  • SQL syntax (extension of SQL syntax for querying databases), for SharePoint search only

  • FQL (FAST-specific Query Language syntax), for FAST only

  • URL syntax (search parameters are encoded in URL and posted directly to the search page)

KQL is the only syntax that end users would typically see. As a developer, this syntax is simpler to use than the SQL search syntax because you do not have to parse search terms to build a SQL statement; you pass the search terms directly to the Search Service. You also have the advantage that KQL works across both SharePoint and FAST, whereas SQL and FQL are codebase-specific. You can pass two types of terms in a Windows SharePoint Services Search keyword query: keywords (the actual query words for the search request) and property filters (the property constraints for the search request). KQL has been enhanced with SharePoint 2010 to include parametric search, so there should be very little need for SQL.

Keywords can be a word, a phrase, or a prefix. These can be simple (contributes to the search as an OR), included (must be present; for example, AND, denoted by "+"), or excluded (must not be present; for example, AND NOT, denoted by "-").

Property filters provide you with a way to narrow the focus of the keyword search based on managed properties. These are used for parametric search, which enables users to formulate queries by specifying a set of constraints on the managed property values. For example, searching for a wine with the parameters {Varietal: Red, Region: France, Rating: >90, Price: <$10} is easy to achieve with property filters, and easy to explore using refiners.

KQL supports using multiple property filters within the same query. You can use multiple instances of the same property filter or different property filters. When you use multiple instances of the same filter, it means OR; for example, author:"Charles Dickens" author:"Emily Bronte" returns results from either author. When you use different property filters, it means AND; for example, author:"Isaac Asimov" title:"Foundation*" returns results that only match both. Property filters also enable you to collapse duplicates; for example, duplicate:http://<displayUrl> requests duplicate items for the specified URL (which would otherwise be collapsed).

With SharePoint Server 2010, enhancements to keyword query syntax enable more complex search queries that in the past were supported only by the SQL query syntax. These enhancements include support for wildcard suffix matching, grouping of query terms, parentheses, and logical operators, such as AND, OR, NOT, and NEAR. Improved operators now support regular expressions, case-sensitivity, and content source prioritization. KQL can express essentially anything you can say with SQL. The Advanced Search page, for example, now creates KQL rather than SQL.

FAST Query Language (FQL)

FAST Search has a number of extensions beyond the standard SharePoint search that are available on both the Federation and Query Object Models, and also on the Query web service. The following list contains some examples.

  • The FAST Query Language, which supports advanced query operators, such as XRANK for dynamic (query-time) term weighting and ranking

  • Deep refiners over the whole results set and the possibility of adding refiners over any managed property

  • Advanced sorting using managed properties or a query-time sort formula

  • Advanced duplicate trimming, with the ability to specify a custom property on which to base duplicate comparisons

  • "Similar documents" matching

  • The FAST Search Admin Object Model for promoting documents or assigning visual Best Bets to query keywords/phrases

The FAST Query Language (FQL) is intended for programmatic creation of queries. It is a structured language and not intended to be exposed to the end users. The FAST Query Language can be used only with FAST Search for SharePoint. Certain FAST Search for SharePoint features can only be accessed by using this query language; for example:

  • Detailed control of ranking at query time, using RANK/XRANK operators, query term weighting, and switching on/off ranking for parts of a query

  • Advanced proximity operators (ordered/unordered NEAR operators)

  • Advanced sorting, using SORT/SORTFORMULA operators

  • Complex combinations of query operators, such as nesting of Boolean operators

FQL opens a whole world of search operations to the developer. The full set of capabilities is too long to cover in this book, but the reference documentation is available on MSDN.

Examples Using Query Customization

Now let's see how all of this works together, using some examples. First, imagine that you are building an application that helps users research information about companies. Perhaps you want to bring back additional information about a company whenever it is mentioned. If a query uses a company name of a publicly listed company, let's bring back stock information; if a query uses a ticker symbol, let's put that information on top. Finally, if a query brings back information that is tagged with companies, let's show the current stock price as metadata in the main Results web part.

Code Project 6-P-4 walks through this example. It processes the query and adds synonyms and parameters, in addition to using federation, if there is a ticker symbol to be found. On the content side, it uses ticker symbol metadata to look up the current price for the top results returned.

Another example is location-aware search. Now we will use a FAST-specific operator, SORTFORMULA, to sort results in order by distance. We can also cut out results beyond a threshold. Figure 30 shows how this works, and Code Project 6-P-5 walks through how to do this.

Figure 30. Sorting results in order by distance

Sorting results in order by distance

 

A significant aspect to an individual's work in an organization is interacting with other people and finding the right people to connect with who have specific skills and talents. This can be a daunting challenge in a large organization. SharePoint Server 2010 addresses this challenge through search, and connects this search to the social capabilities in SharePoint Server 2010. A People Search Center provides specific capabilities for connecting with people.

End-User-Visible Functionality

We touched on the people search function at the beginning of this chapter (see Figure 10). Now let's run through a few other aspects of social search that are visible to the end user, so that you can see how to use this in your application.

Mining and Discovering Expertise

Users can manually submit or automatically generate a list of colleagues mined from Outlook. Using automatically generated lists of colleagues is a way of rapidly inferring social relationships throughout the organization, which speeds the adoption and usefulness of people search results. SharePoint Server 2010 also infers expertise by automatically suggesting topics mined from the user's Outlook inbox and suggesting additions to her expertise profile in her My Site. This makes it easy to populate My Site profiles and means that more people have well-populated profiles and get the benefits of this in both search and communities.

Improving Search Based on Social Behavior

For many organizations, SharePoint sites have become gathering places where people create, share, and interact with information. Social behavior is taken into account in order to provide high-quality search results in several ways. First, the relevance ranking for people search takes social distance into account: a direct colleague will appear before someone three degrees removed (for example, a friend-of-a-friend-of-a-friend). Second, SharePoint Server 2010 supports the social tagging of content, and this feedback can influence the relevance of content in search results. People's day-to-day usage of information in SharePoint Server 2010 and Microsoft Office can have a measurable effect on search relevance, thereby helping the organization harness the collective wisdom of its people.

Social Search Architecture and Operations

Social search capabilities work directly out of the box. In most cases, you will not need to change them; you can just use them as part of your application. However, understanding the architecture and good practices for operations is useful, regardless of whether you plan to extend social search capabilities.

Architecture and Key SSAs

There are three Shared Service Applications (SSAs) that are critical to the SharePoint 2010 farm tuned for social search. The user profile SSA is the datasource, which can draw from AD, LDAP, or other repositories that store data about employees. The managed metadata SSA provides a way to store relationships between metadata values and enable some control over the health of the data in the profile store. The Search SSA features tune results, refinement, and ranking to take advantage of the data coming from the user profile application and the managed metadata application.

Figure 31 shows these components and how they relate to each other in SharePoint 2010.

Figure 31. Search SSA components

Search SSA components

Managing User Profiles

Because social search is based in large part on user profiles, there are some basic techniques organizations should use to help keep these profiles fresh and high quality. These include encouraging users to use photos and update profile information. Turning on "knowledge mining" and encouraging users to publish suggested keywords are also possible techniques. All of these use out-of-the-box features, without any extensions.

SharePoint 2010 provides out-of-the-box properties such as Responsibilities, Interest, Skills, and Schools, but as a developer, you might want to add new properties. This involves setting up a connection to Managed Metadata SSA, adding custom profile properties, and then adding the new property to the profile store.

You might also want to extend user profiles and social search in other ways, for example, by bringing in external profile information, generating and normalizing profile information, and so forth. You can also extend the colleague and expertise suggestion capabilities.

Social Tags

Social tags are indexed as part of the people content source. The tag is stored with the person, not the item, until it gets to the search system. This is important because it means that end users can tag external content, and anything with a URL. Tagging is useful for many purposes, including content management and social computing.

Social tags affect the ranking of search results for SharePoint Server 2010 Search but not for FAST Search for SharePoint. To provide this for FAST, you would extend the standard crawl and/or provide application logic to collect and pre-process social tags.

Note

Extending Social Search. People search can be extended in the same ways that we described for content search; that is, customized web parts, federation, and query processing. It can also be extended via user profiles as described earlier.

Content Enhancement

The old adage "garbage in, garbage out" springs to mind when considering content quality. Content preparation means selecting the right content, appropriately transforming and tagging it, cleansing or normalizing the content (making content regular and consistent, using appropriate spelling or style), and reducing the complexity of disparate data types. Information sources may include everything under the sun: web/intranet (HTML, XML, multimedia), file and content management systems (Doc, XLS, PDF, text, XML, CAD, and so on.), email (email text, attachments), and databases (structured records). Findability is enhanced significantly by content enhancements and linguistic techniques.

Metadata and linguistics are essential to search. Understanding how they work and how you can extend them enables you to build great search-driven applications.

Crawled Properties, Managed Properties, and Schemas

Search is performed on managed properties. Managed properties contain the content that will be indexed, including metadata associated with the items. Mapping from crawled properties to managed properties is controlled as part of the search configuration. A typical approach is to first perform a crawled property discovery based on an initial set of crawled items. Based on the results, you can change the mapping to managed properties.

Managed properties can be used for ranking, sorting, and navigation (refiners). Assessing which managed properties to use as metadata in your application is one of the most important aspects of creating great findability. Search will find information anywhere via a full-text search on the body of documents, but using metadata makes the search quality better, as well as enables sorting and navigation. You can add additional managed properties at any point in the development and deployment process, but having a good core set to start makes development and testing much easier.

In the search world, linguistics is defined as the use of information about the structure and variation of languages so that users can more easily find relevant information. This is important for properly using search tools in various "natural" languages; think of the structural differences between English and Chinese. It is also important to industry-specific language usage; for example, the English used in an American pharmaceutical company versus that used in a Hong Kong-based investment bank.

As you plan your application, get a sense of the number of languages involved, both in the content and among the user population. Find out what vocabularies exist in the organization; these may be formal taxonomies, dictionaries purchased from specialist firms, or informal lists from glossaries and team sites.

The Problem of Missing Metadata

Missing or incorrect metadata is a significant problem. An anecdotal example illustrates this point: A census performed on one company's PowerPoint presentations a year ago showed that nearly a quarter were authored by the CEO himself. This result was, doubtless, because, as the founder, he created early presentations that have since been edited, copied, and modified many times over. As the saying goes: "This is my grandfather's axe; my father changed the blade, and I changed the handle." Lacking automatic systems that update metadata, the "history" of the document can incorrectly categorize the current version.

Advanced Content Processing with FAST

FAST Search for SharePoint includes a scalable, fault-tolerant, and extensible content-processing pipeline, based on technology from FAST.

Figure 32. High level structure of the content processing pipeline

High level structure of the pipeline

 

The content-processing pipeline is a framework for refining content and preparing it for indexing, with the following characteristics:

  • A content-processing pipeline is composed of small simple stages, each of which does one thing. Stages identify attributes such as language, parse the structure of the document encodings (document format, language morphology, and syntax), extract metadata, manipulate properties and attributes, and so forth.

  • A wide range of file formats (over 400) are understood and made available for indexing by the pipeline.

  • A wide range of human languages are detected and supported by the pipeline (82 languages detected, 45 languages with advanced linguistics features. This includes spell-checking and synonyms (which improves the search experience) and lemmatization (which provides higher precision and recall than standard techniques like stemming).

  • Property extraction creates and improves metadata by identifying words and phrases of particular types. Pre-built extractors include Person, Location, Company, E-mail, Date, and Time. A unique offensive-content-filtering capability is also included.

Content Pipeline Configuration

The content-processing pipeline in FAST Search for SharePoint can be configured and extended. This is made available in a structured fashion. It is simpler, more robust, and less error prone than with FAST ESP. Configuration of each stage is done via GUI or XML configuration, and is available via PowerShell. In the pipeline, content is mapped into "crawled properties" (whatever is found in the content) and then into "managed properties" (mapped into a schema and made available for searching, sorting, and navigation). This schema is accessible via GUI or PowerShell.

Content Pipeline Extensibility

There are several ways for developers and partners to add value in content processing, including the following examples.

  • Configure connectors, pipeline configurations, and the index schema to support specific search applications.

  • Apply optional pipeline stages such as using the XML properties mapper, the Offensive Content Filter, and field collapsing (which enables grouping or folding results together).

  • Create custom verbatim extractors (dictionary-driven identification terms and phrases; for example, to ID all product names or project names and extract these as managed properties for each document.

  • Create custom connectors, using BCS (or other APIs) to bring in and index data from specific systems and applications.

  • Process content prior to crawling; for some applications pre-processing content prior to crawling is useful (such as separating large reports into separate documents). This can be done externally to search or within a connector shim.

  • Extend the pipeline. By creating code that is called immediately before the PropertiesMapper stage, you can apply specialized classifiers, entity extractors, or other processing elements to support specialized scenarios.

Multilingual Search

If your organization is truly global, then the need for multilingual search is clear. But even if you initially think that all of your organization's search needs are English only, it is fairly common to discover that some percentage of users and content are non-English.

You should think carefully about the language-specific features of your search function. If people search only for content in their own language, or if there is wide variation in the language types used (English, Polish, and Chinese, for example), then it will help to have users specify their language in the query interface. Where there are common linguistic roots (for example, on an e-commerce site that features English and Dutch content), it might be easier to handle everything in the most common language, in this case, English.

A full description of linguistics and how you can use it to improve search is beyond the scope of this book. But there are a few things that you should know about linguistics, including the following:

  • Better use of linguistics will improve precision and recall

  • Industry and user knowledge are needed to optimize search systems

  • Linguistic choices can affect hardware and performance

  • Some sites should favor language independence

  • Bad queries can be turned into good queries with the proper linguistic tools

For many search applications, the out-of-the-box search configuration is all you need. User language choices are set in the Preferences panel of the Search Center and, by default, are determined from the browser. But be aware that linguistic processing can provide a lot of power in multilingual situations or in situations that demand particularly tuned recall and precision.

Extending Search Using the Administrative OM

SharePoint Server 2010 provides an extensible Search Health Monitoring Object Model. This object model enables administrators and developers to customize the administrative dashboards and pages that provide snapshots of the overall health of the search system, and to provide ways to troubleshoot and identify the underlying causes of any problems. The Search Health Monitoring user interface provides tools for monitoring the health of functional search subsystems (for example, crawling and indexing), search content sources, and key components (for example, databases) of the search system's topology.

Authentication and Security

Security in search is both a simple and a deep subject. Simply put, search uses the user's credentials and the entitlements on any content that has been indexed to ensure that users can see only content they are entitled to read. For OOB connectors and straightforward security environments, this just works. As you build custom connectors and work in heterogeneous and complex security environments, you also have the responsibility to extend security for search.

There are two major new security enhancements with SharePoint 2010. First, item-level security descriptors can now be retrieved for external data exposed by Business Connectivity Services. This means that search security is straightforward when building new connectors with BCS. Second, claims authentication (see Chapter 11) provides a wide range of security options for heterogeneous environments. Search benefits from these significantly, because search is often used as a "bridge" to look across information from many different systems.

Search Reports

The object model supports a reporting system that you can easily customize and extend. You can modify default alert rules and thresholds, for example, by changing the alert rules XML file. You can also upload new reporting applications developed by third parties to a standard search administration document library. The reports generated by these reporting applications are XML files in the standard Report Definition Language Client-Side (RDLC) format. For more information, see the Report Definition Language Specification, which is available on Microsoft TechNet.

Summary: Customizing Search with SharePoint 2010

Building powerful search applications is easier than ever in SharePoint 2010. You can create a wide range of applications based on search, at various levels of customization. You can also combine search with other parts of SharePoint (Insights, Social, Composites, Sites, and Content) to create compelling solutions.

FAST Search is now integrated into the SharePoint platform, and developers of search-driven solutions and applications can leverage a common platform and common APIs for both SharePoint Server 2010 search and FAST Search for SharePoint. This means you can build applications to support both search engines and then extend them if and when desired to take advantage of the more advanced features available with FAST Search, such as dynamic ranking, flexible sort formulae, or deep refiners for insight into the full result set. FAST Search for SharePoint web parts uses the same unified object model as SharePoint Server 2010 search. The result is that if you develop a custom solution that uses the Query Object Model for SharePoint Server 2010, for example, it will continue to work after you migrate your code to FAST Search Server 2010 for SharePoint.

Additional Resources

For more information, see the following resources:

About the Authors

Tom Rizzo is a senior director in the Microsoft SharePoint product management team.

Reza Alirezaei is an independent consultant and a Microsoft MVP who is focused on designing custom applications with SharePoint, Office, and Microsoft Business Intelligence products and technologies. Reza has helped many development teams architect and build large-scale, mission-critical applications. In addition to consulting, Reza is an instructor and speaker. He speaks in many local and international conferences. For complete information about Reza, please see his blog.

Paul J. Swider is a consultant, the Enterprise SharePoint strategist for OnClick Solutions, and president of the Charleston SharePoint Users Group.

Scot Hillier is an independent consultant and Microsoft SharePoint Most Valuable Professional (MVP) focused on creating solutions for Information Workers with SharePoint, Office, and related .NET Framework technologies. He is the author/coauthor of 15 books and DVDs on Microsoft technologies, including Inside Microsoft SharePoint 2010 and Professional Business Connectivity Services. Scot splits his time between consulting on SharePoint projects, speaking at SharePoint events like Tech-Ed, and delivering training for SharePoint developers. Scot is a former U.S. Navy submarine officer and graduate of the Virginia Military Institute. Scot can be reached at scot@shillier.com.

Jeff Fried is a senior product manager at Microsoft and author of more than 50 technical papers.

Kenneth Schaefer is an independent developer and designer focusing on SharePoint and web-based solutions.