Plan the index schema (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese)

Articolo
03/09/2011

Aggiornato: 10 febbraio 2011

This article contains planning considerations for the index schema in Microsoft FAST Search Server 2010 for SharePoint. The index schema is used to specify which managed properties can be searched in the search index and the indexing/query related features associated with these properties.

In this article:

Index Schema overview
Crawled and Managed Properties
Relevance features
Query refinement features

Index Schema overview

You use the index schema to configure the following features:

Which properties to include in the index. You define the mapping from crawled properties to managed properties, and the associated index features.
Full-text indexes. This defines how to apply full-text queries against a given set of managed properties.
Rank profiles. This defines how to achieve a result set that is sorted by rank.
Query Refinement. This describes how statistical information about managed properties can be returned in query results and used for query refinement.

You should consider the index schema strategy prior to deploying a full-scale FAST Search Server 2010 for SharePoint farm. Make sure that you plan the overall index schema strategy before you start indexing large amounts of content. If not, you may need to re-index all the content for the changes to take full effect. It is possible to make incremental changes to the mapping without any service interruption or search downtime, but it is very inconvenient to apply major changes after having indexed large amount of content.

If your deployment will index many million documents, it is recommended to tune the index schema and associated end-user search features on a smaller test installation with a relevant subset of the content you want to index.

The Index Schema plan must take into consideration two main aspects:

The main goal for the index schema plan is to define the desired feature set for your application.
Certain index schema features will have significant effect on the fastsearch farm dimensioning. When you enable certain features this may have significant impact on resource usage in the farm, and may therefore impact the sizing of your farm.

This article discusses key aspects of the index schema that you should take into consideration in the planning phase. The following articles provide additional details on various aspects of the Index Schema:

Optimize search relevance (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese). This topic provides detailed recommendation on search relevance tuning, including relevance aspects of the index schema
Manage index schema (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese). This topic provides examples on how to manage the index schema using Windows PowerShell cmdlets
Plan for performance and capacity (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese). This topic provides additional details on the performance impact of certain index profile related features, which may impact the sizing of your search system

Crawled and Managed Properties

Indexed items consist of several properties, reflecting the actual content and the metadata for the items.

Crawled properties

Crawled properties are metadata extracted from content sources in order to make the data available for searching. Crawled properties are typically reported by the indexing connectors, but may also be created during item processing by an IFilter or a property extractor.

A Crawled Property is uniquely defined by Name, Propset and VariantType.

Each crawled property belongs to a crawled property category, which is a high-level grouping of crawled properties based on the iFilter and Protocol Handler (given by the Indexing connector used and data source) used to extract the metadata from the content.

Examples of categories:

Business Data – metadata that is associated with content in the Business Data Catalog.
Mail – this metadata is associated with Microsoft Exchange Server.
Office – metadata contained in Microsoft Office documents such as Word, Excel, PowerPoint, etc.
People – metadata that is associated with the people profiles in SharePoint. The majority of these are also mapped to various managed properties from Active Directory and SharePoint information.
Web – HTML metadata associated with web pages.

A subset of all crawled properties is automatically mapped to the default full-text index. This means that a simple keyword query will match the content of all these properties. A number of crawled properties contain metadata that is irrelevant or may have bad effect on the search relevance. The conditions that decide whether a crawled property will be automatically mapped are:

Only crawled properties with variant types that map to a string or list of strings.
Crawled properties that are known to provide undesired content in the search index are excluded by setting their IsMappedToContents property to “false”.
Since every crawled property belongs to a category (determined by its propset), the category has a Boolean property (MapToContents) that sets the default value of the IsMappedToContents property of new crawled properties.

For more details on crawled property mapping, see Manage crawled and managed properties (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese).

Managed properties

Managed properties are metadata that may be searched or used in other ways, such as being displayed in search results.

The crawled properties will contain a large amount of different metadata properties. A key phase of your deployment planning is to determine the mapping of these crawled properties to managed properties. In the simplest form, a search index can contain the searchable representation of the body and title of a document. But you will quickly experience the power of mapping and indexing the various metadata of your content sources. By using the FAST Search Server 2010 for SharePoint schema administration services, you can explore the actual crawled properties of the content sources and decide a mapping to managed properties. You will then be able to assign features to the managed properties that provide value-add to the end-user when they make his/her query.

The default index schema provides default mappings that are adapted to common content formats. As you optimize the system for relevance, look at the quality of the content in managed properties, determine whether there are other crawled properties that have better quality for your content, and update the mappings.

You should perform an initial tuning of the crawled property mapping on a test installation with a limited amount of content. This makes it much easier to test your changes.

You can enable query refinement for a managed property using a refiner configuration(informazioni in lingua inglese).

You can associate a managed property with one or more full-text indexes.

Relevance features

You can enable and change a set of features that affects the query result relevance sorting. This article mainly focuses on the performance effect of these features, as this may be important to figure out before sizing your FAST Search Server 2010 for SharePoint farm. For more details on how you can optimize the relevance of your FAST Search Server 2010 for SharePoint farm installation, see Tune relevance factors (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese).

Full-text indexes

Multiple managed properties may be grouped into a full-text index. This allows a query to be executed across several managed properties at the same time. Full-text indexes enable you to have dynamic ranking of queries (results sorted by relevancy). When you type a set of words in the search box of your query front-end, this typically leads to a query against the default full-text index named content. It is also possible to query individual managed properties separately, but such query matches does not contribute to the query result ranking.

A full-text index will typically contain a set of managed properties that represents the content of the item that you are querying. This includes the body of the item, the title, the URL, and so on.

In certain cases it may be desired to define multiple full-text indexes for different kinds of queries or different applications. Although this gives a large amount of flexibility, it will have a certain performance cost for disk space and use of system resources such as file descriptors. It is therefore not recommended to define more than 10 full-text indexes inside an index schema.

Rank profiles

Customizing the rank profiles and creating new rank profiles will have small effect on static system resources like disk and memory. Rank profile features are generally query-time parameters that do not affect the indexing of the items and associated disk space usage. The effect of rank profile changes will mainly have query performance effect as outlined in the following list.

Stop-word threshold. This is an important parameter to avoid that queries for very common words takes too much resources to evaluate. In order to still provide a fair relevancy ranking for item matches with this term, you should use the importance level feature within the index schema.
Managed property boost. This is an efficient way to achieve targeted relevance boost for documents that have managed properties that have certain values. Each managed property boost setting will add to the evaluation time for all queries. Hence, be careful not defining too many such boosts within the same rank profile. It is better to define multiple rank profiles with targeted managed property boost setting.

For further details on rank profile features, see About the rank profile (FAST Search Server 2010 for SharePoint)(informazioni in lingua inglese).

Full-text sorting

Full-text result sorting based on managed properties enables you to obtain an alphabetical sorting of the result set instead of the default sorting based on relevancy (ranking). Providing efficient sorting across the result set requires additional data structures in the index, and this feature is therefore configurable per managed property.

Defining many managed properties that have sorting enabled will have a significant effect on memory usage in the query matching component.

You can control this feature via the managed property SortableType parameter in the index schema.

Consider using the configuration value LatentSortable if you want to prepare the index data structures for result-sorting, but does not want to enable the feature yet for query evaluation. When using this option the required data structures for result sorting is not loaded into main memory, and it will therefore have no performance effect. The setting can later be changed from latent to active in order to enable the feature. In that case the change will have immediate effect (no requirement to re-index items).

Hit highlighted summary

FAST Search Server 2010 for SharePoint includes a configurable automatic summary generator that can generate hit highlighted summaries for selected properties in query results based on the input query. You can control this feature via the managed property SummaryType parameter in the index schema. By default the hit highlighted summary is configured for the body and title properties.

Configuring hit highlighted summary creation for other managed properties will have some performance effect on the query result creation, in particular if the managed property in average contains lots of text.

A key performance parameter that affects hit highlighted summary creation is the managed property MaxResultSize parameter in the index schema. This affects how much textual content from the managed property that is stored with the index. For managed properties that are not configured for hit highlighted summary this parameter affects how much content that is returned in the query results, with direct effect on query performance. In particular this applies for disk accesses and network I/O. For managed properties that is configured for hit highlighted summary this parameter affects the processing load of creating the hit highlighted summary for each hit in the query hit list.

Asian language relevancy optimization

Chinese, Japanese and Korean languages require different character/word normalization than most other languages. These languages do not use spaces consistently to mark token boundaries; texts in these languages must be tokenized by a language-specific tokenization component. We refer to these languages as CJK languages.

FAST Search Server 2010 for SharePoint performs the language specific tokenization based on automatic language detection for the indexed items and the end-user’s locale setting, but also includes an alternative normalization approach named substring search.

Substring search, often known as N-gram search, is typically applied to managed properties that are considered difficult to tokenize automatically. These texts often contain many rare words or new words, such as product names or words rarely found in the tokenizer’s system dictionary.

The feature can also be considered when recall (the overall number of documents retrieved) is considered much more important than precision (high relevancy of the results). Without substring search enabled, a CJK query may, in certain cases, be tokenized incorrectly and therefore return a meager or empty result list. This will never occur if substring search is used, as all N-gram substrings of each token will be indexed, and also N-grams spanning token boundaries. By using this feature, you will improve the recall (more matching items found), but may also reduce the precision and return more items than desired.

You can control this feature via the managed property SubstringEnabled parameter in the index schema.

Note that substring search will have a significant effect on the size of the index for these managed properties. It is therefore not recommended that you use the feature on free-text, but may be considered for metadata that contains domain-specific product names, codes, and so on.

Query refinement features provide the end-user by using relevant refinement options for their queries. It enables drilling down into a query result by using aggregated statistical data computed for the query result. This is typically used for metadata associated with the indexed items, such as creation date, author and person names appearing in the item. By using the refinement options, you can refine your query to only present items created throughout a certain time period, or only display items referencing a given person.

FAST Search Server 2010 for SharePoint supports two kinds of query refiners, deep refiners and shallow refiners.

Deep refiners

The query refinement is based on the aggregation of managed property statistics for all of the results of a search query. The indexer creates aggregation data that are used in query matching process. The advantage of using this type is that the refinement options will reflect all the items matching a query. This is usually the recommended mode, but defining many deep refiners may have a significant effect on memory usage in the query matching component.

Consider using the configuration parameter LatentRefinement if you want to prepare the index data structures for deep refinement, but does not want to enable the feature yet for query evaluation. When using this option the required data structures for deep refinement is not loaded into main memory, and it will therefore have no performance effect. The setting can later be changed from latent to active in order to enable the feature. In that case the change will have immediate effect (no requirement to re-index items).

Importante:
Deep string navigators having many unique values will have significant performance impact on internal I/O communication between the query matching node and the query processing node (if on different servers). If your installation has many index columns, this interface may become a bottleneck. In this case, consider using the configuration parameter CutoffMaxBuckets to limit the number of refinement bins to be evaluated on each index column.

Deep string navigators having many unique values will have significant performance impact on internal I/O communication between the query matching node and the query processing node (if on different servers). If your installation has many index columns, this interface may become a bottleneck. In this case, consider using the configuration parameter CutoffMaxBuckets to limit the number of refinement bins to be evaluated on each index column.

Shallow refiners

The query refinement is based on the aggregation of managed property statistics for the top 100 results for a search query. The refinement result data is created during result processing. Because the refinement is limited to the top matching results, you may be unable to find results hidden deeper in the query results. On the other hand, this refinement option does not affect the indexing process and can therefore apply immediately after enabled.

Shallow refiners will have significant performance effect on the query processing node and will reduce the query performance. Consider using deep refiners instead.

Cronologia delle modifiche

Data	Descrizione	Motivo
10 febbraio 2011	2011/02/07	Aggiornamento contenuto
12 maggio 2010	Pubblicazione iniziale

Condividi tramite