Evaluating and Customizing Search Relevance in SharePoint Server 2007
Summary: Learn about settings you can adjust to improve relevance in Enterprise Search in Microsoft Office SharePoint Server 2007, and how to implement an evaluation process to determine the impact your adjustments make on relevance. (16 printed pages)
Dmitriy Meyerzon, Microsoft Corporation
Avi Schmueli, Microsoft Corporation
Jo-Anne West, Microsoft Corporation
June 2007
Applies to: Microsoft Office SharePoint Server 2007, Microsoft Office SharePoint Server 2007 for Search
Contents
Introduction to Evaluating Relevance in Enterprise Search Implementations
Scenarios and Challenges to Evaluation
Side-by-Side UI Comparisons
Statistical Evaluation Process
Tuning Relevance
Conclusion
Additional Resources
Introduction to Evaluating Relevance in Enterprise Search Implementations
Multiple factors contribute to user satisfaction with search. These factors include precision (finding the right answers), recall (finding all the answers), visual design, usability, speed, and so on. So how would IT professionals evaluate the effectiveness of an enterprise search implementation? What alternative implementations of enterprise search would they consider to create greater user satisfaction? Can you determine and improve the precision of the search function objectively? How do you measure, compare, and improve different aspects of the search experience?
The Microsoft development team for SharePoint Search examined these issues during development of Enterprise Search in Microsoft Office SharePoint Server 2007 and in Search in Windows SharePoint Services. This article discusses the team's approaches and findings. It focuses especially on the robust methodology that guided SharePoint Search in choosing and investing in the most suitable algorithms, tuning them to best fit corporate environments, and measuring progress. This methodology that served the team during development is offered as an example for IT professionals to evaluate alternative search products, and even to gauge progress as they modify the user experience and customize their SharePoint Search deployments for their organizations.
Scenarios and Challenges to Evaluation
Different needs can motivate an evaluation of a search implementation. The design of the evaluation is determined by what you are trying to find out.
A simple scenario evaluates competing search products to select the best available implementation. The effectiveness of the search product candidates depends on a review of the full set of their features. End-user satisfaction is affected by index coverage, ranking and finding algorithms, and results presentation—from the placement, size, and color of the results, to the type of properties that can be displayed and the level of interaction that is available. The search product that has the best combination of features would be "best."
But an organization can rarely deploy a search product "as is" and meet its specific needs. Real-world enterprise search implementations require much more. Consider Microsoft Office SharePoint Server 2007. Office SharePoint Server 2007 is a server environment that is easy to install and configure, and Enterprise Search in Microsoft Office SharePoint Server 2007 requires minimal configuration to make it run. But the implementations for most organizations must customize the end-user search experience for the context of the organization, and the implementation's appearance and behavior ("look and feel") must fit into the organization's broader environment. Entry points and search results must suit existing navigation schemes. Search scopes must reflect existing categorization. Advanced search must present local result types and properties, reflecting local work processes. Each enterprise environment has its own collection and taxonomy of keywords, acronyms, and professional jargon, with associated resources and editorial content, to integrate into the search experience. Finally, the document and content types and storage mechanisms differ, and they benefit from custom tuning of ranking parameters to achieve optimal results relevance.
In this context, evaluation is the tool that must provide usable information about which approaches to take to implement search effectively. This was the approach the SharePoint Search development team took to evaluate its progress during early deployments within the Microsoft corporate intranet.
MSWeb, the Microsoft internal corporate portal, serves more than 66,000 unique visitors per month and was the earliest adopter of Enterprise Search in Microsoft Office SharePoint Server 2007. Each page in the MSWeb portal offers a search box; the portal handles about 25,000 queries per day. MSWeb provides a focal search experience for all Microsoft employees, offering document, people, and business data search tabs and an advanced property-based search. MSWeb consumes search results from a centrally managed Shared Services Provider (SSP). The SSP's search index, holding 20 million documents, also serves thousands of departmental sites, team sites, and individual My Sites.
In this article, we discuss our experiences with evaluation and tuning of search ranking on MSWeb. We focus in particular on statistical evaluation of the relevance of search results, and how to apply that relevance in customer sites. We collected the most quantitative, usable information from statistical evaluation; however, we also discuss other valuable approaches we used.
Side-by-Side UI Comparisons
To conduct a simple evaluation of implementation A versus implementation B, a side-by-side comparison of the user interfaces (UI) is most suitable. We used this type of UI comparison to help us evaluate our current Microsoft search implementation against previous releases, and compare it with other search products.
The Enterprise Search UI provides a single search box. The user enters the query or selects from a set of popular user queries. Returned results are displayed in a results page that shows the results from two different search implementations, displayed side by side. The user then chooses the more satisfying page of results: right or left. The user can also provide additional feedback.
You must be careful to run the comparison in equivalent environments. Both search experiences should draw on a similar collection of documents. To generate true evaluations from users, the collection should ideally be comprehensive or complete and reflect resulting documents that would satisfy typical or real queries.
During development, the team employed two versions of this comparison UI. The first, shown in Figure 1, embeds two search results pages in side-by-side frames.
Figure 1. User interface for side-by-side comparison
One advantage to this approach is that it reflects the search experience in maximum fidelity. The approach enables each competing side (showing product UI) to show its full UI to the user. The only way that this approach fails to replicate the full user experience with each competing provider is that it forces the UI into half of the screen space; as a result, it can potentially hide certain page elements from the user. This forces the user to navigate differently from a real-world experience of the UI, and it can detract from the experience.
Full fidelity in showing the UI is also a hindrance when the comparison seeks to reveal more granular insights about the strengths and weaknesses of the UI on each side. For example, presenting a result title in a 10-point font size versus an 11-point font size might have a beneficial effect of showing more UI, but it is difficult to determine whether the font size is one of many variables on which both sides differ. More generally, by using this approach, you cannot distinguish the effects of the UI from the effects of the relevance ranking on the users' experience.
To make this distinction clear, the development team for SharePoint Search used a variant of this UI comparison that tries to represent an "anonymous" view of the competing "sides." Figure 2 shows this variant.
Figure 2. Side-by-side comparison with generic user interface
The code behind the pages in this variant submits the queries to the search providers via their respective APIs and displays the list of results in a neutral presentation. This removes any UI-specific effects.
By using these comparison pages, you can easily determine which ranking approach the users preferred. In addition, interleaving both approaches can give sharper insights into how presentation and raw ranking algorithms contributed to the results relevance. For example, if users typically prefer side 1, but then vote for side 2 when the UI is stripped from the page, this would indicate that presentation elements in the latter have a positive effect.
Both side-by-side evaluation approaches are useful as barometers on the quality of retrieval and as a source of insights into improvement opportunities. But both approaches have limitations, as follows:
They offer no robust means of identifying, at a granular level, and using large-scale data, where the opportunities for improvements are, or which ingredients to invest in.
They do not indicate how to build a better search engine (during product development) or how to deploy (when implementing search for a given enterprise environment).
For this, we require a more precise evaluation methodology.
Experience shows that intuition, insight, and instincts are poor guides for making modifications to search algorithms. A common scenario could be as follows: a small set of frequent or sensitive queries returns inadequate results. For example, queries for the CEO's favorite project or for a particular product name return few or no matches. Feedback streams to the IT department; analysis suggests an obvious solution. It could be, "The result we expected should appear on top because the project name is mentioned in the first line." Or, "The top results are all wrong because the query terms do not appear in the title." The fix works well for the problem query and is applied. But dozens of other queries immediately suffer reduced relevance, and what seemed like a sensible solution becomes a bigger problem.
We need a method that automatically evaluates the effectiveness of proposed approaches, and we need a way to test the insights and instincts efficiently and objectively. We discuss the statistical evaluation process we adopted in the following section.
Statistical Evaluation Process
The statistical evaluation process has the following basic steps:
Define the metrics to use.
Set up a collection of queries.
Annotate queries with user intents.
Generate "candidate" results for evaluations.
Harvest evaluator judgments.
Examine the metrics.
Tune parameters to customize or improve ranking.
We discuss each step in detail in the following sections.
Define the Metrics to Use
To the development team for SharePoint Search, improving relevance meant better end-user satisfaction with search. To keep the metrics true to this goal, we followed these simple guidelines:
A "relevant" result is a real result that satisfies a real user's query.
When measuring relevance, we take a sample of real users—not subject matter experts, taxonomists, or search experts. We let them review a sample of real users' queries, culled from real query logs. We then give them a sample of real documents. We ask them to grade the documents on a scale to determine whether results are good for the specific queries. The query sample is a balanced representation of different query classes, ensuring optimal performance across popular and unique queries.
Metrics should closely reflect users' needs.
We keep our metrics close to the end-user model as they examine search results. The three essential metrics are as follows:
Precision at 10: The number of relevant results in the typical first returned Web page of results (containing the first 10 results). For example:
Two relevant results in the first page: 0.2
Three relevant results in the first page: 0.3
Precision at 5: The number of relevant results in the typical visible part of the first returned Web page, that is, results that appear "above the fold" (the first five results). For example:
Two relevant results in the first five: 2/5 = 0.4
One relevant result in the first five: 0.2
Reciprocal rank: Reciprocal of the position of the first relevant result in the set. For example:
If the first relevant result is found at the top, 1/1 = 1.0
If the first relevant result is found at the third position: 1/3 = 0.33
Normalized Discounted Cumulative Gain (NDCG): A cumulative metric that allows for multilevel judgments; averaged over the query set, it is the overall accuracy number for the ranking engine. The engine is penalized for returning bad results with a higher rank and good results with a lower rank.
For a given query Qi, the following formula computes the NDCG as follow, Where N is typically 3 or 10:
Figure 3. Formula for computing NDCG
We defined four "quality levels" for relevant results:
Excellent For example, truly canonical resources, official copies, or any result that deserves the first spot.
**Good **Not as good as excellent results, might link to the excellent result, or provide additional information. Good results should usually appear in the top 10 results returned.
Fair Might provide some information. Fair results can possibly be good results for some user intents.
Bad The result is not relevant to the query at all.
We also defined a special case for results that were broken links.
We could track each of our metrics against each quality level. Using varied search environments and query collections, the numbers help guide the development efforts to address a wide variety of search circumstances.
Set Up a Collection of Queries
Full-text search queries, as with language in general, are known to follow the Zipf distribution curve. This is the "80/20" pattern visible in every query log: a small number of queries are repeated many times, making up the largest number of log entries. The remainder of the log entries consists of a very long trail of rare queries, with unique queries at the very end of the log.
The frequent queries are typically shorter. Search engines behave differently when processing single-term and multiterm queries, so search metrics should include queries from both ends of this curve. Do not assume that focusing on the most common queries is sufficient. If you take this assumption to the extreme, then focusing on frequent queries would be a successful strategy, and automatic search would not be necessary. A typical organization could employ a small staff of writers to construct the best results for the top 100 or 200 unique queries. They could craft ideal results, review them on a regular schedule, and guarantee optimal results nine out of ten times that a query is submitted.
The drawback is in the trailing end: while nine out of ten entries in the log are frequent queries, doing poorly on rare queries condemns nine out of ten unique queries for every user. Users quickly realize that the search engine can handle only a small set of trivial requests. Good metrics must reflect the frequency spectrum. If you need search software at all, you need software that does well on both rare and common queries.
Other Query Classes
Query frequency and query length are just two aspect of query classes we must recognize and balance in the collection of queries. Each organization has additional categories and classes of queries they might want to measure specifically. The following are a few examples of query classes that can have specialized characteristics, and that you should represent in a well-balanced evaluation collection of queries. In addition, when evaluations are repeated to track improvements over time, you must review the query set periodically to handle queries whose meaning is time sensitive or age sensitive.
Time-sensitive queries On the Microsoft intranet, queries frequently reflect code names, project names, or team names. Code names follow certain products until their public release, at which time the products assume new, official names. Typically, the code name is quickly forgotten. Like products, projects and even teams and whole organizations go through naming changes and periodic reorganization. It is important to track this class of common queries and retire them before they become obsolete and contaminate the collected relevance judgment data.
Another class of time-sensitive queries relate to periodic, cyclical events, such as annual holidays or conventions, employee or shareholder meetings, or performance reviews. The number of queries related to such events rises and falls by the calendar. References to the 2007 company meeting might be excellent in 2007, but the same page in the archive might be a frustrating result the following year.
Context-sensitive queries Many query intents are context-sensitive, meaning that "what is a good result" depends on "who are you, where do you work, and what do you do?" Good results can vary widely, depending on the person's location, job description, and so on; yet many such queries are submitted to central portals. Tracking these as a class enables evaluation of improvements.
**Class equivalents **Some queries belong to a class of equivalent queries: if we do well on one, we are likely to do well on all. For example, most queries for a specific employee name are equivalent. (Famous or important employees or managers are a reasonable exception that you might want to track specifically; similarly, you might want to specifically track user names that for some reason expose bugs and weaknesses in the retrieval system.)
Another example of this principle is the class of queries that seek to locate a colleague by office address or internal telephone number. A good handling of one set of these queries likely helps all of them. Because evaluation can be expensive, there is little return to be had from tracking more than a representative set.
**Acronyms **A big part of corporate vocabulary is expressed in acronyms. You should track these as a class to determine relative strengths or improvements of retrieval approaches.
Annotate Queries with User Intents
Short queries consist of a single term or (less frequently) double-term queries. Naturally, the shorter the query, the more ambiguous and vague it is. When users type terms such as "Windows" or "Test," what are they really looking for? The result that would help them depends on the context—on the information need that hides behind the short query. Users might expect to get many different results, more or less relevant to their need, and then choose the one or two that serve best. Or they might fully expect to engage in a short cycle of query refinement, submitting longer or different queries based on the result sets. Indeed users might have few expectations of the search experience. But no matter how casual users are with queries, the chance to increase satisfaction always resides with giving them the best result for the original intent. Merely providing a set of results that covers all possible intents equally is a low-risk strategy, and it can be improved by removing results that are not relevant at all—for any possible intent, sense, or meaning of a query—but it can never aim to maximize user satisfaction.
A search engine can maximize relevance only if it can satisfy the intent hidden behind the user's imperfect text query. The development team for SharePoint Search evaluated performance against specific user intents. To achieve this, each query was annotated with a short description of the original user intent. Evaluators then judged documents against strict criteria of relevance to the specifically described user scenario associated with each query.
This effort was time-consuming and demanding, but using intents as the key in the evaluation data allowed us to build a metrics database that tracks our performance against the ultimate goal: to maximize relevance and provide "results that satisfy the true user goal," not simply to provide "results that could satisfy some users or some possible meaning for this query."
The following excerpt describes the query annotation philosophy that guided the development team for SharePoint Search as it built its collection of queries and annotated them with intents. It is an excerpt from an internal FAQ that is used to train evaluation contributors.
Create good intent descriptions in the relevance evaluation system.
An intent description should explain what a single specific user wants when he or she submits a query in a single instance. It must be the following:
Credible (a single user has a real reason to type X to find out Z)
Probable (it is not valuable to track our success in finding information about sheep and cattle for the query of the code name "Longhorn" in Microsoft)
Specific (users don't ask about both Longhorn sheep and software together, ever)
Do not try to track multiple intents at one time.
Intents that track more than a single intent skew the metrics, because they cause the system to track the union of successes across all possible intents, rather than the rate of success for each. Following is an example of a badly phrased query description, which tracks multiple intents:
Query: X media device
Description: When will the X connector ship? What will it cost? What does it consist of? Is there a beta on the intranet? What features does it have? Can I play variable-rate MP3 using the X device?
The query intent description should help the evaluator identify good and bad results.
Evaluators need help identifying relevant and irrelevant results. They need a good definition of what the user is looking for. Even if the information need is very basic, for example, "just tell me about X," you must explain and define what X is, so that multiple evaluators can assess relevance of results with equal success. Following is an example of an unhelpful intent description for the query "X": "Tell me about X".
Following is an example of a good intent description, because it educates the evaluator and gives a standard by which to judge results: "X is a <short description>. Tell me about X." The intent description does not assume X is unambiguous or that the evaluator is an expert on X.
Intent description should always describe something specific that a specific user intends to find.
Generate "Candidate" Results for Evaluations
With a collection of queries prepared and properly annotated with short descriptions of intent, we are one step closer to gathering user evaluations. However, before we can begin gathering, we must decide which documents (results) users are to judge or rank as relevant or not against each query in our collection.
In a user-facing search interface, results are commonly a list of 10 or 20 items (per page). Evaluation is not a simple matter of ensuring that each of these documents is addressed, and then generating the metrics. Simply generating metrics for the results returned by the search engine is limited because the metrics do not indicate how to improve a given implementation. Using the metrics to compare two different implementations is also questionable. Unless the collection of queries and index documents is identical, both implementations are monolithic, and neither can be modified.
In real-world situations, certainly during product development but also during deployment, you can modify or improve an implementation, and the metrics can help determine the direction to follow. To provide metrics that can do this, the evaluations must address a range of possible results that is broader than the narrow set returned by a specific implementation.
For example, suppose (as is the case) that matching on a term within the document title contributes more to relevance than matching elsewhere in the result record, and should therefore be given more weight. A weight of 1.0 would be the same as the term occurring once; a weight of 20.0 would be the same as if the term, when in the title, is equivalent to 20 occurrences elsewhere in the body of the result.
After we determine that titles are important, how do we determine the level of their importance? Are they 1.2, 20, or 50 times as important in the title of the result as in the body?
Instincts and estimations are a good starting point for the investigation, but they should not be used to determine the best value. To do so without the risk of doing harm, you must use an evaluation system. The range of possible results must be extended to include documents that match on an array of possible values. For variation on a single parameter, as in the Title property weight example, generating the candidates for evaluation could follow the algorithm roughly described in the following pseudocode.
For possible title property weights from min to max
Configure the system with that property weight for Title
Run a query and get a set of results back
For each result
If not in the current set of candidates for evaluation,
Then add it to the set.
End For
End For
Note
This is just an example, and not the process you must always use when optimizing the weight setting for a property. Many parameters for the ranking calculation are interdependent, and to find the optimal combination that includes the new parameter, you would need to return all the parameters. This is not possible without the proper tools; however, if you see the metrics improve, at least you know that you are moving in the right direction for improved relevance.
Tuning property weights, or the addition of a new, custom property definition with specialized weights to the ranking formula, are two of the most common user requests when customizing SharePoint Search ranking. However, more variables are available for customization. The following sections in the Microsoft Office SharePoint Server 2007 SDK address programmatic access to property weight and property length normalization constants:
In addition, the ranking formula offers a list of other constants that you can vary programmatically. Varying the following constants can, for example, optimize retrieval quality for specific file types, such as XML or Microsoft Office Word documents.
Changing the constants for custom environments should follow a rigorous evaluation as described in this article. Table 1 describes the parameters you can customize.
Table 1. Types of parameters to customize
Parameter | Description |
---|---|
k1 |
Saturation constant for term frequency. |
Kqir |
Saturation constant for click distance. |
wqir |
Weight of click distance for calculating relevance. |
Kud |
Saturation constant for URL depth. |
wud |
Weight of URL depth for calculating relevance. |
languageprior |
Weight for ranking applied to content in a language that does not match the language of the user. |
filetypepriorhtml |
Weight of HTML content type for calculating relevance. |
filetypepriordoc |
Weight of Microsoft Office Word content type for calculating relevance. |
filetypepriorppt |
Weight of Microsoft Office PowerPoint content type for calculating relevance. |
filetypepriorxls |
Weight of Microsoft Office Excel content type for calculating relevance. |
filetypepriorxml |
Weight of XML content type for calculating relevance. |
filetypepriortxt |
Weight of plain text content type for calculating relevance. |
Filetypepriorlistitems |
Weight of list item content type for calculating relevance. |
Filetypepriormessage |
Weight of Microsoft Office Outlook e-mail message content type for calculating relevance. |
Harvest Evaluator Judgments
The SharePoint Search development team had to adopt an effective means of harvesting relevance judgments given the following constraints:
The need for corporate evaluators.
Unlike users of Internet searches, the typical user of Enterprise Search is the Knowledge Worker: one who is familiar with corporate documents that are, generally, restricted to corporate readership. As a result, you cannot pay general users to evaluate and judge query results; you must rely on the pool of corporate users. Also, corporate users are not interchangeable; documents are proprietary and cannot be shared and read across different intranet environments.
The need for balanced evaluation collection.
Passively collecting judgments from volunteers is attractive, but limited in reach. We posted a link on the corporate search results page offering the opportunity to give judgments, and promoted the opportunity on the intranet sites. This generated volunteer judgments, but the contributions tended to cluster around the queries that were easiest to judge: unambiguous queries with familiar original intents, matched with easy-to-evaluate documents. Unfortunately, many queries are more difficult to judge because they apply to less familiar subject matter, or the result set requires more effort to read and evaluate.
We adopted an event approach. Collecting evaluations was concentrated to a two-day or a week-long period. Each event encouraged participation by raffling individual prizes to good contributors and promoting friendly competition among teams.
Tuning Relevance
You can tune relevance by modifying the weight and length normalization settings for managed properties, or by modifying the global ranking parameters. Performing a full optimization is equivalent to a search in a more than 20-dimensional space, which is impossible without proper machine-learning tools. However, you can make changes in a small (for example, one or two) number of dimensions informally, if you can properly evaluate the accuracy of the search engine before and after the changes. Even if you cannot find an optimal combination of settings, you can still determine whether or not search relevance improves.
Usually, custom relevance tuning makes sense if the search administrator is familiar with the specific aspects of the content that is being crawled that are not addressed by any of the built-in relevance features and settings.
Changing the Property Weight
In most cases, modifying the property weight has a greater impact than modifying the length normalization setting. The range of possible values for this setting is 0 to infinity; however, in most cases you would want to configure this setting to a value between 1 (the weight setting for the body of a document) and 70 (the weight setting for the Title managed property). When the value is set to 0, the property is essentially removed from being used in the ranking algorithm.
The following code example shows you how to write the weight setting for all the managed properties in an SSP's Enterprise Search metadata schema.
// Replace <SiteName> with the name of a site using the SSP.
string strURL = "http://<SiteName>";
Schema sspSchema = new Schema(SearchContext.GetContext(new SPSite(strURL)));
ManagedPropertyCollection properties = sspSchema.AllManagedProperties;
foreach (ManagedProperty prop in properties)
{
Console.WriteLine("Name: " + prop.Name + " Value: " + prop.Weight.ToString());
}
When you change the weight setting for a managed property, the change takes effect immediately following a restart of the search service.
For information about how to change the weight setting for a specific managed property, see How to: Change the Weight Setting for a Managed Property.
Property Length Normalization
Modifying the length normalization setting applies only to properties that contain text. For example, consider a scenario where relevance is calculated for two content items containing the query term within the body of the content item, within a book, and within a document containing only a short paragraph. The book is likely to contain more instances of the query term, so it might receive a higher rank value, even if the shorter document is just as relevant to the user's search. The length normalization setting addresses this issue, providing consistency in ranking calculations for text properties, regardless of the amount of text within that property.
The range of possible values for this setting is 0 to 1. If this setting is 0 for a managed property, length normalization is turned off for that property; length normalization has the greatest influence for properties that have a setting of 1.
For long text-managed properties, you usually want to set this to a value near to 0.7, which is the approximate setting for the body property. For managed properties that contain a small amount of text, but which are important to relevance, use the Title managed property's value for this setting, which is approximately 0.5.
You must use the Enterprise Search Administration object model to view the length normalization settings for different properties. The following code example demonstrates how to do this.
// Replace <SiteName> with the name of a site using the SSP.
string strURL = "http://<SiteName>";
Schema sspSchema = new Schema(SearchContext.GetContext(new SPSite(strURL)));
ManagedPropertyCollection properties = sspSchema.AllManagedProperties;
foreach (ManagedProperty prop in properties)
{
Console.WriteLine("Name: " + prop.Name + " Value: " + prop.LengthNormalization.ToString());
}
Conclusion
You can tune relevance in Enterprise Search by adjusting the weight and length normalization settings for managed properties. However, you should ensure that you are properly testing the effects of your changes on the overall relevance, preferably by using an evaluation system with relevance judgments assigned to each document-query pair.
Having a set of relevance judgments at the individual document level enables you to make subsequent changes to the ranking parameters without having to reexamine every query and every document. This is because it is likely that most of the documents returned with the new settings will be labeled already, and only the missing judgments will have to be filled in.
This article discusses considerations for implementing this type of framework, and describes some approaches the development team for SharePoint Search took during the relevance evaluation process for Enterprise Search.