Web Crawler XML configuration reference
Gilt für: FAST Search Server 2010
Letztes Änderungsdatum des Themas: 2016-11-29
The FAST Search-Webcrawler automatically retrieves information from Web sites, and passes this information to the Microsoft FAST Search Server 2010 for SharePoint index. The FAST Search-Webcrawler is configured by creating an XML configuration file formatted as specified in this article, and submitting it to the Web crawler using the crawleradmin.exe command-line tool.
The format specified in this document is also used by the crawlercollectiondefaults.xml file, which contains all the default options/values for new crawl collections. When you modify it, you change the defaults for all new collections. The default values are used for any option not specified in the XML configuration created for a specific crawl collection.
These configuration files must be formatted in compliance with the XML schema. This document includes a Simple configuration and a Typical configuration example of a configuration file. For an overview of the elements and sections in the configuration file, refer to the table in Web Crawler XML configuration quick reference.
Key terminology
Web site refers not to a SharePoint site, but to the content on a Web site such as www.contoso.com.
Host name refers to either "contoso" in http://contoso/ or "download.contoso.com" in http://download.contoso.com/. It can be either fully qualified or not. In this document, the difference between a Web site and a host name is that a Web site describes the actual site and its content, whereas the host name is the network name that is used to reach a given Web server. A single site might have multiple host names.
Creating a new crawl configuration
Hinweis
To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
Follow these steps to create a new crawl configuration using this XML configuration format:
Copy one of the three supplied crawl configuration templates found in <FASTSearchFolder>\etc (where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch) to a new file such as MyCollection.xml, or create a new file. Edit the file in a text editor to include the elements and settings that you must have.
Hinweis
Use a text editor (e.g., Notepad) to change crawlercollectiondefaults.xml. Do not use a general-purpose XML editor.
Run crawleradmin.exe –f MyCollection.xml to add the crawl configuration to the crawler. Replace MyCollection.xml with the name that you gave the file in step 1.
See crawleradmin.exe reference for more information.
Customizing crawlercollectiondefaults.xml
Warnung
Any changes that you make to this file will be overwritten and lost if you:
-
Run the Set-FASTSearchConfiguration Windows PowerShell cmdlet.
-
Install a FAST Search Server 2010 for SharePoint update or service pack.
Remember to reapply your changes after you run the Set-FASTSearchConfiguration Windows PowerShell cmdlet or install a FAST Search Server 2010 for SharePoint update or service pack.
Hinweis
To modify a configuration file, verify that you meet the following minimum requirements: You are a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.
To edit this file:
Edit crawlercollectiondefaults.xml in a text editor to include the elements and settings that you must have. Use the existing file in <FASTSearchFolder>\etc\ as a starting point.
Hinweis
Use a text editor (for example Notepad) to change crawlercollectiondefaults.xml. Do not use a general-purpose XML editor.
Run nctrl.exe restart crawler to restart the FAST Search-Webcrawler process with the options that you set in step 1.
Web Crawler XML configuration quick reference
This table lists the elements in the Web Crawler XML configuration format. The elements can appear in any order with the following exceptions. CrawlerConfig holds the DomainSpecification element. The primary elements of SubDomain, Login, and Node occur inside the DomainSpecification element. The section and attrib sub-elements can occur in any of the primary elements, in any order. The member sub-elements must appear inside an attrib element only.
<CrawlerConfig>
<DomainSpecification>
<SubDomain/>
<Login/>
<Node/>
<attrib>
<member/>
</attrib>
<section/>
</DomainSpecifcation>
</CrawlerConfig>
Typically, you will include both attrib and section sub-elements in SubDomain, Login, and section elements. The Node element may contain all these elements and sub-elements.
Element | Description |
---|---|
CrawlerConfig |
This top-level element specifies that the XML following it is a Web crawler configuration object. |
DomainSpecification |
This element specifies a crawl collection. |
SubDomain |
This element specifies the configuration of crawl sub collections. |
Login |
This element is used for HTML forms-based authentication. |
Node |
This element overrides configuration parameters in a crawl collection or a crawl sub collection for a particular node scheduler. |
attrib |
This sub-element specifies a configuration setting, either by its value or by a set of member elements. |
member |
This sub-element specifies a configuration setting in a list. |
section |
This sub-element specifies a section that contains multiple settings grouped by type. A table listing all possible sections follows. |
This table defines the section options in the Web Crawler XML configuration format. Sections cannot occur inside the CrawlerConfig element.
Section name | Description |
---|---|
include_domains |
Defines a set of host name filters that specify which URIs to include in a crawl collection |
exclude_domains |
Defines a set of host name filters that specify which URIs to exclude from a crawl collection |
include_uris |
Defines a set of URI rules that specify which URIs to include in a crawl collection |
exclude_uris |
Defines a set of URI rules that specify which URIs to exclude from a crawl collection |
log |
Specifies logging behavior for the Web crawler process |
storage |
Specifies how the Web crawler stores content and metadata |
pp |
Specifies the post processing behavior for a node scheduler |
ppdup |
Specifies duplicate server settings |
feeding |
Consists of at least one section element that specifies how to send a representation of the crawl collection to the indexing engine |
cachesize |
Configures the cache sizes for the Web crawler process |
http_errors |
Specifies how to handle HTTP/HTTPS error response codes and conditions |
ftp_errors |
Specifies how to handle response codes and error conditions for FTP URIs |
workqueue_priority |
Specifies the priority levels for the crawl queues, and specifies the rules and modes used to insert URIs into and extract URIs from the queues |
link_extraction |
Specifies which kind of hyperlinks to follow |
limits |
Specifies fail-safe limits for a crawl collection |
focused |
Configures focused scheduling |
passwd |
Configures credentials for Web sites that require authentication |
ftp_acct |
Specifies FTP accounts for crawling FTP URIs |
exclude_headers |
Specifies items to exclude from the crawl, based on the contents of the HTTP header fields |
variable_delay |
Specifies time slots that use a different delay request rate |
adaptive |
Specifies the adaptive crawling options |
weights |
Each URI is given a score in the adaptive crawling process. The weights section must occur inside an adaptive section. |
sitemap_weights |
<URL> entries in a sitemap can contain a changefreq element, which specifies how frequently a URI can be modified. The string values are converted into a numeric weight for adaptive crawling. The sitemap_weights section must occur in an adaptive section. |
site_clusters |
Specifies configuration parameters that override the crawler’s usual behavior of routing host names in a node scheduler |
crawlmode |
Limits the span of a crawl collection |
post_payload |
Submits content to HTTP POST requests |
rss |
Initializes and configures RSS feed support in a crawl collection |
logins |
This is a special case of a Login element; multiple Login elements are merged into a logins section. Either a logins section or one or more Login elements are required when you define HTML forms-based authentication. You must use logins to remove a login because of the way partial configurations work. Note that exporting a configuration from the crawler with crawleradmin returns the Login element. |
parameters |
Sets the authentication credentials that are used in an HTML form. Must occur in a Login element or a logins section. |
subdomains |
Specifies the configuration of crawl sub collections. This is a special case of a SubDomain element; multiple SubDomain elements are merged into a subdomains section. You must use subdomains to remove a subdomain because of the way partial configurations work. Note that exporting a configuration from the crawler with crawleradmin returns the SubDomain element. |
Web Crawler XML configuration file format
XML elements in the configuration file begin with <
and end with />
.
The basic element format is as follows:
<attrib name=" value " type=" value "> value </attrib>
For example:
<attrib name="accept_compression" type="boolean"> yes </attrib>
Elements, section names, attributes, and attribute values are case-sensitive. Attribute names and types must be enclosed in quotation marks (" ").An element definition can span multiple lines. Spaces, carriage returns, line feeds, and tab characters are ignored in an element definition.
For example:
<attrib
name=" accept_compression "
type="boolean"
> yes </attrib
>
Tipp
For long parameter definitions, position values on separate lines and use indentation to make the file easier to read.
The <CrawlerConfig>
element is a special case and is required. All other elements are contained within the <CrawlerConfig>
element, and the element is closed with </CrawlerConfig>.
The basic structure of the XML file is in the following example:
<?xml version="1.0"?>
<CrawlerConfig>
<DomainSpecification>
...
</DomainSpecification>
</CrawlerConfig>
You can add comments anywhere, delimited by <!--
and -->
.
CrawlerConfig
This top-level element specifies that the XML following it is a Web crawler configuration object. A Web crawler configuration file can contain only one CrawlerConfig XML element.
DomainSpecification
This element specifies a crawl collection.
Example
<CrawlerConfig>
<DomainSpecification name="sp">
...
</DomainSpecification>
</CrawlerConfig>
Replace "sp"
with the crawl collection name.
attrib
This element specifies a configuration option, either a single value or a list using the member element.
Attributes
Name | Type | Value | Meaning |
---|---|---|---|
info |
string |
A text description of the crawl collection. |
|
fetch_timeout |
integer |
<seconds> |
Specifies the maximum downloading time, in seconds, for a Web item. Increase this value if you expect to download large Web items from slow Web servers. Default: 300 |
allowed_types |
list-string |
Specifies valid Web item MIME types. The Web crawler process discards other MIME types. This configuration parameter supports wildcard expansion of a whole field. Wildcards are represented by an asterisk character. For example: "text/*" or "*/*" but not "*/html" or "application/ms*". Default:
|
|
force_mimetype_detection |
boolean |
yes|no |
Specifies that the Web crawler process uses its own MIME type detection on items. In most cases, Web servers return the MIME type of Web items when they are downloaded, as part of the HTTP header. If this option is enabled, Web items will get tagged with the MIME type that looks most accurate: either the one received from the Web server or the result of the crawler’s detection. Default: no |
allowed_schemes |
list-string |
HTTP HTTPS FTP |
Specifies the URI schemes that the Web crawler should process. Default: HTTP |
ftp_passive |
boolean |
yes|no |
Specifies that the Web crawler uses passive FTP mode. Default: yes |
domain_clustering |
boolean |
yes|no |
Specifies whether to route host names from the same domain to the same site manager process. Useful when you are dealing with host names that must share information such as cookies, because this information is not exchanged between site manager processes. If enabled in a multiple node configuration, host names on the same domain (for example, www.contoso.com and forums.contoso.com) will also be routed to the same node scheduler. Default for single node: no Default for multiple node: yes |
max_inter_docs |
integer |
positive integer, or no value |
Specifies the maximum number of items to crawl before interleaving Web sites. By default, the crawler will crawl a Web site to exhaustion, or until the maximum number of Web items per Web site is reached. However, the crawler can be configured to crawl "batches" of Web items from Web sites at a time, interleaving between Web sites. This attribute specifies how many Web items to consecutively crawl from a server before the crawler interleaves and starts crawling other servers. The crawler will return to crawling the former server when resources are freed up. Default: empty (disabled) |
max_redirects |
integer |
<value> |
Specifies the maximum number of HTTP redirects to follow from a URI. Default: 10 |
diffcheck |
boolean |
yes|no |
Specifies that the Web crawler performs duplicate detection. Duplicate detection is performed by checking whether two or more Web items have the same content. Default: yes |
near_duplicate_detection |
boolean |
yes|no |
Specifies that the Web crawler must use a less strict duplicate detection algorithm. In this case duplicate items are detected by identifying a unique pattern of words. Default: no |
max_uri_recursion |
integer |
<value> |
Use this attribute to check for repeating patterns in URIs. The option specifies the maximum number of times a pattern can be repeated before the resulting URI is discarded. A value of 0 disables the test. For example, https://www.contoso.com/widget linking to https://www.contoso.com/widget/widget is a repetition of 1 element. Default: 5 |
ftp_searchlinks |
boolean |
yes|no |
Specifies that the Web crawler should search for hyperlinks in items downloaded from FTP servers. Default: yes |
use_javascript |
boolean |
yes|no |
Specifies if JavaScript support should be enabled in the Web crawler. If enabled, the Web crawler will download, parse/execute, and extract links from any external JavaScript. Hinweis JavaScript processing is resource intensive and should not be enabled for large crawls. Hinweis Processing JavaScript uses the Browser Engine component. For more information, see beconfig.xml reference. Default: no |
javascript_keep_html |
boolean |
yes|no |
Specifies what to submit to the indexing engine. If this parameter is set to yes, the HTML that results from the JavaScript processing is used. Otherwise, the original HTML item is used. Do not use this option if the use_javascript configuration parameter is not set to yes. |
javascript_delay |
real |
<seconds> An empty value means that the Web crawler uses the same value as the delay configuration parameter |
Specifies the delay in seconds, to use when you are retrieving dependencies associated with an HTML item with JavaScript. Default: 0 (no delay) |
exclude_exts |
list-string |
<comma delimited list of file_extensions> |
Specifies file name extensions that should be excluded by the crawl. Default list: empty |
use_http_1_1 |
boolean |
yes|no |
Specifies that the Web crawler should use HTTP/1.1. When set to no, HTTP/1.0 is used. Default: yes |
accept_compression |
boolean |
yes|no |
Specifies that the Web crawler should accept compressed Web items from the Web server. This parameter has no effect if the use_http_1_1 configuration parameter is not enabled. Default: yes |
dbswitch |
integer |
<value> |
Specifies the number of crawl cycles that a Web item can remain in the crawl store and index without being found by the Web crawler, before it is deleted. The dbswitch_delete parameter determines the action that should be taken for Web items that are not seen for this number of crawl cycles. Hinweis Setting this value very low to 1 or 2 may accidentally delete Web items. Default: 5 |
dbswitch_delete |
boolean |
yes|no |
The Web crawler tries to detect Web items that were removed from the Web servers. This parameter determines what to do with those Web items. They can be deleted immediately or put in the work queue for retrieval to make sure that they are no longer available. When set to yes, Web items that are too old are deleted. When set to no, Web items are scheduled for re-retrieval and only deleted if they no longer exist on the Web server. This check is performed independently for each Web site, at the start of each refresh cycle. Hinweis You should keep this option at the default value. Default: no |
html_redir_is_redir |
boolean |
yes|no |
Use this parameter with html_redir_thresh to treat META Refresh tags inside HTML Web items as if they were HTTP redirects. When enabled, the Web item that contains the META refresh will not be indexed. When disabled, they are treated as regular Web items and are indexed. Default: yes |
hmtl_redir_threshold |
integer |
<value> |
Specifies the maximum number of seconds that a META Refresh tag inside an HTML Web item can be treated as an HTTP redirect. This parameter is ignored if html_redir_is_redir is not set. Consider the following example:
If the number that is specified in the Default: 3 |
robots_ttl |
integer |
<seconds> |
Specifies how frequently the Web crawler should retrieve the robots.txt file from a Web site. The frequency must be specified in seconds. Default: 86400 |
use_sitemaps |
boolean |
yes|no |
Enables the Web crawler to discover and parse sitemaps. The Web crawler uses the lastmod attribute in a sitemap to determine whether a Web item was modified since the last time that the sitemap was retrieved. Web items that were not modified will not be re-crawled. An exception is if the collection usesadaptive refresh mode. In adaptive refresh mode, the crawler uses a sitemap’s priority and changefreq attributes to determine how often a Web item should be crawled. Other tags found in sitemaps are stored in the crawler’s meta database and are submitted to indexing as crawled properties. Hinweis Most sitemaps are specified in robots.txt. Thus, the robots attribute should beenabled for the best results. Default: no |
max_pending |
integer |
<value> |
Specifies the maximum number of concurrent HTTP requests to a single Web site at any time. Default: 2 |
robots_auth_ignore |
boolean |
yes|no |
Specifies whether the Web crawler should ignore robots.txt if an HTTP 40x authentication error is returned by the Web server. When set to no, the Web crawler will not crawl the Web site upon encountering the error. The robots.txt standard lists this behavior as a hint for Web crawlers to ignore the Web site completely. However, incorrect configuration of a Web server can incorrectly exclude a site from the crawl. Enable this option to make sure that the Web site is crawled. Default: yes |
robots_tout_ignore |
boolean |
yes|no |
Specifies whether the Web crawler should ignore the robots.txt rules if the request for robots.txt times out. Before crawling a Web site, the Web crawler requests the robots.txt file from the Web server. By the robots.txt standard, if the request for this file times out, the Web site will not be crawled. Setting this parameter to yes ignores the robots.txt rules in this case, and the Web site is crawled. Hinweis You should keep this option set to no if you do not own the Web site being crawled. Default: no |
rewrite_rules |
list-string |
Specifies a set of rules that are used to rewrite URIs. A rewrite rule has two components: an expression to match ( The format of the rewrite rule is as follows: |
|
extract_links_from_dupes |
boolean |
yes|no |
Specifies that the Web crawler should extract hyperlinks from duplicate Web items. Even when two Web items have duplicate content, they may have different hyperlinks, which could lead to more content being found by the Web crawler. Default: no |
use_meta_csum |
boolean |
yes|no |
Specifies that the Web crawler includes META tags in the generated duplicate detection fingerprint. Default: no |
csum_cut_off |
integer |
<value> |
Specifies the maximum number of bytes to use to generate the duplicate detection fingerprint. If this parameter is set to 0, the feature is disabled (i.e., unlimited/all bytes will be used). Default: 0 |
if_modified_since |
boolean |
yes|no |
Specifies whether the Web crawler should send HTTP headers that contain a value of Default: yes |
use_cookies |
boolean |
yes|no |
Specifies whether the Web crawler should send and store cookies. This feature is automatically enabled for Web sites that use a login, but can also be turned on for all Web sites. Default: no |
uri_search_mime |
list-string |
<values> |
Specifies the MIME types from which the Web crawler extracts hyperlinks. This configuration parameter supports wildcard expansion only at the whole field level. A wildcard is represented by the asterisk character; for example, Default:
|
max_backoff_counter |
integer |
<value> |
Together with max_backoff_delay, this option controls the algorithm by which a Web site experiencing connection failures is contacted less frequently. For each consecutive network error, the request delay for that Web site is increased by the original delay setting, up to a maximum of max_backoff_delay seconds. This delay is maintained until a request successfully is completed, but for no more than max_backoff_counter number of requests. If the maximum count is reached, crawling of the Web site is temporarily stopped. Otherwise, when network issues affecting the Web site are resolved, the internal backoff counter starts decreasing, and the request delay is decreased by half on each successful Web item download until the original delay setting is reached. Default: 50 |
max_backoff_delay |
integer |
<seconds> |
See max_backoff_counter. Default: 600 |
delay |
real |
<seconds> |
Specifies how frequently (in seconds) the Web crawler can retrieve a Web item from a Web site. Default: 60.0 |
refresh |
real |
<minutes> |
Specifies how frequently (in minutes) the Web crawler should start a new crawl refresh cycle. The action that is performed at the time of refresh is determined by the refresh_mode setting. Default: 1500.0 |
robots |
boolean |
yes|no |
Specifies that the Web crawler should obey the rules found in robot.txt files. Default: yes |
start_uris |
list-string |
Specifies start URIs for the Web crawler. The Web crawler needs either start_uris or start_uri_files to start crawling. Hinweis If the crawl includes any IDNA host names, enter them using UTF-8 characters, not in the DNS encoded format. |
|
start_uri_files |
list-string |
Specifies a list of files that contain start URIs. These files are stored in plain text file format, with one start URI per line. Hinweis In a multiple node deployment, these files must only be available on the server that runs the multi-node scheduler. |
|
max_sites |
integer |
<value> |
Specifies the maximum number of Web sites that can be crawled at the same time. In a multi-node Web crawler deployment, this value applies per node scheduler, not to the whole Web crawler. For example, if max_sites is set to 5 and you have 10 sites to crawl, 5 sites must finish crawling before the crawler can crawl the other 5. Hinweis A high max_sites value can adversely affect system resource usage. Default: 128 |
mirror_site_files |
list-string |
Specifies a list of files that contain mirror sites for a specified host name. A mirror site is a replica of an already existing Web site. This file uses the following format: a plain text file that has a space-separated list of host names with the preferred name listed first. Hinweis In a multiple node Web crawler deployment, this file must be available on all servers where a node scheduler is deployed. |
|
proxy |
list-string |
Specifies a set of HTTP proxies that the Web crawler uses to fetch Web items. Each proxy is specified by using the following format:
The password can be encrypted as specified in passwd. |
|
proxy_max_pending |
integer |
<value> |
Specifies a limit on the number of outstanding open connections per HTTP proxy. Default: maximum value of INT32 |
headers |
list-string |
<header> |
Specifies additional HTTP headers to add to the request sent to the Web servers. The current default is as follows: |
cut_off |
integer |
Specifies the maximum number of bytes in an item. A Web item larger than this size limit is discarded or truncated depending on the value of the truncate configuration parameter. If no cut_off configuration parameter is specified, this option is disabled. Default: no cut-off |
|
truncate |
boolean |
yes|no |
Specifies whether a Web item should be truncated when a Web item exceeds the specified cut_off threshold. Default: yes |
check_meta_robots |
boolean |
yes|no |
Specifies that the Web crawler should follow the For example, a typical META tag might be:
or
The special value Default: yes |
obey_robots_delay |
boolean |
yes|no |
Specifies that the Web crawler should follow the crawl-delay directive (if present) in robots.txt files. Otherwise, the delay setting is used. Default: no |
key_file |
string |
Specifies the path of an SSL client certificate key file that is used for HTTPS connections. This feature is used for Web sites that require the Web crawler to authenticate itself using a client certificate. This option must be used with cert_file. Hinweis In a multi-node Web crawler deployment, the file must be on all node schedulers. |
|
cert_file |
string |
Specifies the path of an X509 client certificate file that is used for HTTPS connections. This option must be used with key_file. |
|
max_doc |
integer |
<value> |
Specifies the maximum number of Web items to download from a Web site. Default: 100000 |
enforce_delay_per_ip |
boolean |
yes|no |
Specifies that the Web crawler limits requests to Web servers whose names map to a shared IPv4 or IPv6 address. This parameter depends on the delay configuration parameter. Default: yes |
wqfilter |
boolean |
yes|no |
Specifies whether the Web crawler should use a bloom filter that removes duplicate URIs from the crawl queues. Default: yes |
smfilter |
integer |
<value> |
Specifies the maximum number of bits in the bloom filter that removes duplicate URIs from the queue associated with the node scheduler. A bloom filter is a space-efficient probabilistic data structure (a bit array) which is used to test whether an element is a member of a given set. The test may yield a false positive but never a false negative. Default: 0 |
mufilter |
integer |
<value> |
Specifies the maximum number of bits used in the bloom filter, which removes duplicate URIs, which are sent from a node scheduler to a multi-node scheduler. We recommend that you turn on this filter for large crawls, with a value of 500000000 (500 megabit). Default: 0 |
umlogs |
boolean |
yes|no |
Specifies whether all logging is sent to the multi-node scheduler for storage. If this parameter is not enabled, logs reside only on the node schedulers. Default: yes |
sort_query_params |
boolean |
yes|no |
Specifies whether the Web crawler should sort the parameters in the query component of a URI. Typically, query components are key-value pairs that are separated by semicolons or ampersands. When this configuration parameter is set, the query is sorted alphabetically by the key name. Default: no |
robots_timeout |
integer |
<seconds> |
Specifies the maximum number of seconds that the Web crawler can use to download a robots.txt file. Default: 300 |
login_timeout |
integer |
<seconds> |
Specifies the maximum number of seconds that the Web crawler can use for a login request. Default: 300 |
send_links_to |
string |
Specifies a crawl collection name to which all extracted hyperlinks are sent. |
|
cookie_timeout |
integer |
<seconds> |
Specifies the maximum number of seconds a session cookie is stored. A session cookie is a cookie that has no expiration date. Default: 300 |
refresh_when_idle |
boolean |
yes|no |
Specifies whether the Web crawler should trigger a new crawl refresh cycle when it becomes idle. This option should not be used in a multi-node installation. Default: no |
refresh_mode |
string |
append|prepend|scratch|soft|adaptive |
Specifies the refresh mode of a crawl collection. Valid values are as follows:
Default: scratch |
Examples
<attrib name="delay" type="real"> 60.0 </attrib>
<attrib name="max_doc" type="integer"> 10000 </attrib>
<attrib name="use_javascript" type="boolean"> no </attrib>
<attrib name="info" type="string">
My Web crawl collection crawling my intranet.
</attrib>
<attrib name="allowed_schemes" type="list-string">
<member> http </member>
<member> https </member>
</attrib>
member
This specifies an element in a list of option values.
The member element can only be used inside an attrib element.
Example
<attrib name="allowed_schemes" type="list-string">
<member> http </member>
<member> https </member>
</attrib>
section
This element groups a set of related options. A section element contains attrib elements.
Attributes
Attribute | Value | Description |
---|---|---|
name |
<name> |
Specifies the name of the section. Supported sections are described in this article. |
Example
<section name="crawlmode">
<attrib name="fwdlinks" type="boolean"> no </attrib>
<attrib name="fwdredirects" type="boolean"> no </attrib>
<attrib name="mode" type="string"> FULL </attrib>
<attrib name="reset_level" type="boolean"> no </attrib>
</section>
include_domains
This section is a set of host name filters that specify which URIs to include in a crawl collection. An empty section matches any host name.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
exact |
list-string |
Specifies a list of host names. If the host name of a URI exactly matches one of these host names, the URI is included by this rule. |
|
prefix |
list-string |
Specifies a list of host names. If the host name of a URI begins with one of these host names, the URI is included by this rule. |
|
suffix |
list-string |
Specifies a list of host names. If the host name of a URI ends with one of these host names, the URI is included by this rule. |
|
regexp |
list-string |
Specifies a list of regular expressions. If the host name of a URI matches one of these regular expressions, the URI is included by this rule. |
|
ipmask |
list-string |
Specifies a list of IPv4 address masks. If the IPv4 address of a retrieved URI matches one of these IPv4 address masks, the URI is include by this rule. An IPv4 address mask must follow one of the following formats:
|
|
ip6mask |
list-string |
Specifies a list of IPv6 address masks. If the IPv6 address of a retrieved URI matches one of these IPv6 address masks, the URI is included by this rule. An IPv6 address mask must follow one of the following formats:
|
Example
<section name="include_domains">
<attrib name="exact" type="list-string">
<member> www.contoso.com </member>
<member> www2.contoso.com </member>
</attrib>
<attrib name="prefix" type="list-string">
<member> www </member>
</attrib>
<attrib name="suffix" type="list-string">
<member> .contoso.com</member>
<member> .contoso2.com</member>
</attrib>
<attrib name="regexp" type="list-string">
<member> .*\.contoso\.com </member>
</attrib>
<attrib name="file" type="list-string">
<member> c:\myinclude_domains.txt </member>
</attrib>
</section>
exclude_domains
This section is a set of host name filters that specify which URIs to exclude from a crawl collection. An empty section will not match any host name.
Attributes
See the table in include_domains for the attrib elements for this section.
Example
<section name="exclude_domains">
<attrib name="exact" type="list-string">
<member> www.contoso.com </member>
<member> www2.contoso.com </member>
</attrib>
<attrib name="prefix" type="list-string">
<member> www </member>
</attrib>
<attrib name="suffix" type="list-string">
<member> .contoso.com</member>
<member> .contoso2.com</member>
</attrib>
<attrib name="regexp" type="list-string">
<member> .*\.contoso\.com </member>
</attrib>
<attrib name="file" type="list-string">
<member> c:\myexclude_domains.txt </member>
</attrib>
</section>
include_uris
This section is a set of URI-based rules that specify which URIs to include in a crawl collection. An empty section will match all URIs.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
exact |
list-string |
Specifies a list of URIs. If a URI exactly matches one of these URIs, the URI is included by this rule. |
|
prefix |
list-string |
Specifies a list of strings. If a URI begins with one of these strings, the URI is included by this rule. |
|
suffix |
list-string |
Specifies a list of strings. If a URI ends with one of these strings, the URI is included by this rule. |
|
regexp |
list-string |
Specifies a list of regular expressions. If a URI matches one of these regular expressions, the URI is included by this rule. |
Example
<section name="include_uris">
<attrib name="exact" type="list-string">
<member> https://www.contoso.com/documents/doc2.html </member>
</attrib>
<attrib name="prefix" type="list-string">
<member> https://www.contoso.com/documents/ </member>
</attrib>
<attrib name="suffix" type="list-string">
<member> /doc2.html </member>
</attrib>
<attrib name="regexp" type="list-string">
<member> http://.*\.contoso\.com/documents.*</member>
</attrib>
<attrib name="file" type="list-string">
<member> c:\myinclude_uris.txt </member>
</attrib>
</section>
exclude_uris
This section is a set of URI-based rules that specify which URIs to exclude from a crawl collection. An empty section will not match any URIs.
Attributes
See the table in include_uris for the attrib elements for this section.
Example
<section name="exclude_uris">
<attrib name="exact" type="list-string">
<member> https://www.contoso.com/documents/doc2.html </member>
</attrib>
<attrib name="prefix" type="list-string">
<member> https://www.contoso.com/documents/ </member>
</attrib>
<attrib name="suffix" type="list-string">
<member> /doc2.html </member>
</attrib>
<attrib name="regexp" type="list-string">
<member> http://.*\.contoso\.com/documents.*</member>
</attrib>
<attrib name="file" type="list-string">
<member> c:\myexclude_uris.txt </member>
</attrib>
</section>
log
This section specifies logging behavior for the Web crawler process.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
fetch |
string |
text|none |
Enable/disable logging of downloaded Web items. Valid values are as follows:
Default: text |
postprocess |
string |
text|xml|none |
Enable/disable logging of node scheduler item post processing. Valid values are as follows:
Default: text |
header |
string |
text|none |
Enable/disable logging of HTTP headers. Valid values are as follows:
|
screened |
string |
text|none |
Enable/disable logging of all screened URIs. Valid values are as follows:
|
scheduler |
string |
text|none |
Enable/disable logging of adaptive crawling. Valid values are as follows:
|
dsfeed |
string |
text|none |
Enable/disable the logging of content submission to the indexing engine. Valid values are as follows:
|
site |
string |
text|none |
Enable/disable logging per crawl site. Valid values are as follows:
|
Example
<section name="log">
<attrib name="dsfeed" type="string"> text </attrib>
<attrib name="fetch" type="string"> text </attrib>
<attrib name="postprocess" type="string"> text </attrib>
<attrib name="screened" type="string"> none </attrib>
<attrib name="site" type="string"> text </attrib>
</section>
storage
This section specifies how the Web crawler stores data and metadata.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
datastore |
string |
flatfile|bstore |
Specifies the format for Web item content storage. Valid values are as follows:
Default: bstore |
store_http_header |
boolean |
yes|no |
Specifies that the Web crawler should store the received HTTP header. Default: yes |
store_dupes |
boolean |
yes|no |
Specifies that the Web crawler should store duplicate Web items. Default: no |
compress |
boolean |
yes|no |
Specifies that downloaded items should be compressed before storing them. Default: yes |
compress_exclude_mime |
list-string |
Specifies a set of MIME types of Web items that should not be compressed when stored. Use for Web items that are already compressed, e.g. multimedia formats. If the compress configuration parameter is not set, this parameter is not applicable. |
|
remove_docs |
boolean |
yes|no |
Specifies that the Web crawler should delete Web items from the Web crawler store as soon as they are submitted to the indexing engine. This will reduce disk space requirements for the Web crawler, but will make it impossible to refeed. Default: no |
clusters |
integer |
<value> |
Specifies the number of clusters to use for storage in a crawl collection. Web items are distributed among these storage clusters. Default: 8 |
defrag_threshold |
integer |
<percentage> |
A non-zero value that specifies the threshold value (of used capacity) before defragmenting a data storage file. When the used space is less than thedefrag_threshold, the file is eligible for defragmentation to reclaim fragmented space caused by stored Web items. Database files are compacted regardless of fragmentation level. The default of 85% means there must be 15% reclaimable space in the data storage file to trigger defragmentation. A value of 0 disables defragmentation. This setting is only applicable to the Default: 85 |
uri_dir |
string |
<path> |
Specifies a path for storing file lists of all hyperlinks that are extracted from Web items. Each site manager process uses a separate file. The name of a URI file is constructed by concatenating the process PID with |
Example
<section name="storage">
<attrib name="store_dupes" type="boolean"> no </attrib>
<attrib name="datastore" type="string"> bstore </attrib>
<attrib name="compress" type="boolean"> yes </attrib>
</section>
pp
This section specifies the post processing behavior for a node scheduler. Post processing consists of two primary tasks: feeding Web items to the index, and performing duplicate detection.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
use_dupservers |
boolean |
yes|no |
Specifies that the Web crawler should use one or more duplicate servers. This option is applicable only in a multi-node installation. Default: no |
max_dupes |
integer |
<value> |
Specifies the maximum number of duplicates to record per Web item. Default: 10 |
stripe |
integer |
<value> |
Specifies the number of data files to distribute the checksum data into. Increasing this value may improve the performance of post processing. Default: 1 |
ds_meta_info |
list-string |
duplicates|redirects|mirrors|metadata |
Specifies the kind of metadata a node scheduler should report to the indexing engine. Valid values are as follows: duplicates: Reports URIs that are duplicates of this item redirects: Reports URIs that are redirected to this item metadata: Reports meta data of this item mirrors: Reports all mirror URIs of this Web item |
ds_max_ecl |
integer |
<value> |
Specifies the maximum number of duplicates or redirects to report to the indexing engine, as specified by the ds_meta_info configuration parameter. Default: 10 |
ecl_override |
string |
Specifies a regular expression that identifies redirect and duplicate URIs that should be stored and possibly submitted to the indexing engine, even though max_dupes is reached. For example: |
|
ds_send_links |
boolean |
yes|no |
Specifies whether all extracted hyperlinks from a Web item should be sent to the indexing engine. |
ds_paused |
boolean |
yes|no |
Specifies whether a node scheduler should suspend the submission of content to the indexing engine. |
Example
<section name="pp">
<attrib name="max_dupes" type="integer"> 10 </attrib>
<attrib name="use_dupservers" type="boolean"> yes </attrib>
<attrib name="ds_paused" type="boolean"> no </attrib>
</section>
ppdup
This section specifies duplicate server settings.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
format |
string |
gigabase|hashlog|diskhashlog |
Specifies the duplicate server database format. Valid values are as follows:
|
cachesize |
integer |
<megabytes> |
Specifies the duplicate server database cache size in megabytes. If the format configuration parameter is set to hashlog or diskhashlog this parameter specifies the initial size of the hash table. |
stripes |
integer |
<value> |
Specifies the number of data files to spread content to. By using multiple files, you can improve the performance of the duplicate server database. |
compact |
boolean |
yes|no |
Specifies whether the duplicate server database should perform compaction. For the hashlog and diskhashlog formats, compaction must be performed either manually with the crawlerdbtool or automatically by enabling this option. Otherwise, disk usage will increase for every record written or updated. Default: yes |
Example
<section name="ppdup">
<attrib name="format" type="string"> hashlog </attrib>
<attrib name="stripes" type="integer"> 1 </attrib>
<!-- 1 GB memory hash -->
<attrib name="cachesize" type="integer"> 1024 </attrib>
<attrib name="compact" type="boolean"> yes </attrib>
</section>
feeding
The feeding section consists of at least one section XML element that specifies how to send a representation of the crawl collection to the indexing engine. Such a section defines a content destination. The name attribute specifies a unique name for the content destination.
Attributes
The following table specifies attrib elements for a content destination section.
Name | Type | Value | Meaning |
---|---|---|---|
collection |
string |
<name> |
Specifies the name of the content collection for submitting Web items. This configuration parameter must be specified in a feeding section. |
destination |
string |
default |
Reserved. This configuration parameter must contain the value default. |
paused |
boolean |
yes|no |
Specifies whether the Web crawler should suspend the submission of content to the indexing engine. Default: no |
primary |
boolean |
yes|no |
Specifies whether this content destination is a primary or secondary content destination. A primary content destination can act on callback information during content submission to the indexing engine. If only one content destination is specified, it will be a primary destination. |
Example
<section name="feeding">
<section name="Global_News">
<attrib name="collection" type="string"> collection_A </attrib>
<attrib name="destination" type="string"> default </attrib>
<attrib name="primary" type="boolean"> yes </attrib>
<attrib name="paused" type="boolean"> no </attrib>
</section>
<section name="Local_News">
<attrib name="collection" type="string"> collection_B </attrib>
<attrib name="destination" type="string"> default </attrib>
<attrib name="primary" type="boolean"> no </attrib>
<attrib name="paused" type="boolean"> no </attrib>
</section>
</section>
cachesize
This section configures the cache sizes for the Web crawler process.
Attributes
The following table specifies attrib elements for this section.
Hinweis
The default value for each attribute, if one is not specified in the table, is to have the Web crawler automatically determine the cache size at run time.
Name | Type | Value | Meaning |
---|---|---|---|
duplicates |
integer |
<value that represents a number of items> |
Specifies the size of the duplicate checksum cache, per site manager process. This cache is used as a first level of duplicate detection at run time. |
screened |
integer |
<value that represents a number of items> |
Specifies the size of the screened URI cache, as the number of hyperlinks. The screened cache filters out duplicate hyperlinks that recently resulted in retrieval failures. |
smcomm |
integer |
<value that represents a number of items> |
Specifies the size of the bloom filter that is used by the cache filtering out duplicate hyperlinks flowing between the node scheduler and site managers. |
mucomm |
integer |
<value that represents a number of items> |
Specifies the size of the bloom filter that is used by the cache filtering out duplicate hyperlinks flowing between the multi-node scheduler and node scheduler. |
wqcache |
integer |
<value that represents a number of items> |
Specifies the size of the cache filtering out duplicate hyperlinks from the Web site crawl queues. |
crosslinks |
integer |
<value that represents a number of items> |
Specifies the size of the crosslink cache. The crosslink cache contains retrieved hyperlinks and referring hyperlinks. It filters out duplicate hyperlinks in the node scheduler if mufilter is not enabled. |
routetab |
integer |
<value> |
Specifies the crawl routing database cache size, in bytes. Default: 1048576 |
pp |
integer |
<value> |
Specifies the post process database cache size, in bytes. Default: 1048576 |
pp_pending |
integer |
<value> |
Specifies the post process pending cache size, in bytes. The pending cache contains entries that were not sent to the duplicate servers. Default: 131072 |
aliases |
integer |
<value> |
Specifies the aliases data mapping database cache size, in bytes. A crawl site can be associated with one or more aliases (alternative host names). Default: 1048576 |
Example
<section name="cachesize">
<!-- Specific cache size values (in number of items) for the following: -->
<attrib name="duplicates" type="integer"> 128 </attrib>
<attrib name="screened" type="integer"> 128 </attrib>
<attrib name="smcomm" type="integer"> 128 </attrib>
<attrib name="mucomm" type="integer"> 128 </attrib>
<attrib name="wqcache" type="integer"> 4096 </attrib>
<!-- Automatic cache size for crosslinks -->
<attrib name="crosslinks" type="integer"> </attrib>
<!-- Cache sizes in bytes for the following -->
<attrib name="routetab" type="integer"> 1048576 </attrib>
<attrib name="pp" type="integer"> 1048576 </attrib>
<attrib name="pp_pending" type="integer"> 1048576 </attrib>
<attrib name="aliases" type="integer"> 1048576 </attrib>
</section>
http_errors
This section specifies how to handle HTTP/HTTPS error response codes and conditions.
Attributes
The following table specifies attrib elements for this section. Because there are multiple values for the name attribute, a description of each purpose is included in the name column.
Name |
Type |
Value |
Meaning |
The name attribute specifies the HTTP/HTTPS/FTP response code number to handle. The character "X" can be used as a wildcard. For example: 4XX Other valid values are as follows:
|
string |
<value> |
Specifies how the Web crawler handles HTTP/HTTPS/FTP and network errors. Valid options for handling individual response codes are as follows:
If RETRY[:X] is specified for either of these options, the Web crawler will re-download the Web item no more than X times in the same crawl refresh cycle period before failing the attempt. Otherwise, the crawler will not try to download the URI until the next crawl refresh cycle. Default: See Default values for the http_errors section and Default values for the ftp_errors section. |
Default values for the http_errors section
The following table specifies the default values for the http_errors section.
Name | Value | Meaning |
---|---|---|
4xx |
DELETE:0 |
Delete immediately. |
5xx |
DELETE:10 |
Delete the tenth time this error is encountered for this URI, usually after 10 crawl cycles. The counter is reset if the URI is successfully retrieved. |
int |
KEEP:0 |
Do not delete. |
net |
DELETE:3, RETRY:1 |
Delete the third time. One retry is specified. This means that the URI will be deleted on the next refresh cycle if it still cannot be retrieved. |
ttl |
DELETE:3 |
Delete the third time. |
Example
<section name="http_errors">
<attrib name="408" type="string"> KEEP </attrib>
<attrib name="4xx" type="string"> DELETE </attrib>
<attrib name="5xx" type="string"> DELETE:10, RETRY:3 </attrib>
<attrib name="ttl" type="string"> DELETE:3 </attrib>
<attrib name="net" type="string"> DELETE:3 </attrib>
<attrib name="int" type="string"> KEEP </attrib>
</section>
ftp_errors
This section specifies how to handle response codes and error conditions for FTP URIs.
Attributes
See the table in http_errors for the attrib elements for this section.
Default values for the ftp_errors section
The following table specifies the default values for the ftp_errors section.
Name | Value | Meaning |
---|---|---|
4xx |
DELETE:3 |
Delete the third time that this error is encountered for this URI, usually after 3 crawl cycles. The counter is reset if the URI is successfully retrieved. |
550 |
DELETE:0 |
Delete immediately. |
5xx |
DELETE:3 |
Delete the third time, same as for 4xx. |
int |
KEEP:0 |
Do not delete. |
net |
DELETE:3, RETRY:1 |
Delete the third time. One retry is specified. This means that the URI will be deleted on the next refresh cycle if it still cannot be retrieved. |
Example
<section name="ftp_errors">
<attrib name="4xx" type="string"> DELETE:3 </attrib>
<attrib name="550" type="string"> DELETE:0 </attrib>
<attrib name="5xx" type="string"> DELETE:3 </attrib>
<attrib name="int" type="string"> KEEP:0 </attrib>
<attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
<attrib name="ttl" type="string"> DELETE:3 </attrib>
</section>
workqueue_priority
This section specifies the priority levels for the crawl queues, and specifies the rules and modes used to insert URIs into and extract URIs from the queues.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
levels |
integer |
<value> |
Specifies the number of priority levels used for the crawl queues. Default: 1 |
default |
integer |
<value> |
Specifies a default priority level that is assigned to URIs in a crawl queue. Default: 1 |
start_uri_pri |
integer |
<value> |
Specifies the priority level for start URIs. See the start_uris and the start_uri_files configuration parameters. Default: 1 |
pop_scheme |
string |
default|rr|wrr|pri |
Specifies the mode used by the Web crawler to extract URIs from the crawl queue. Valid values are as follows:
Default: default |
put_scheme |
string |
default|include |
Specifies which Web crawler mode to use when you insert URIs into the crawl queue. Valid values are as follows:
Default: default |
Priority level section
Within the workqueue_priority section, a set of sections, which specify priority levels and weights of the crawler queues, can be specified. These sections will only be used if the pop_scheme parameter is set to wrr or pri. The name attribute of these sections must be the priority level to be specified. The priority levels must begin at 1. (See <section name="1">
in the following example.)
The include_domains or include_uris section can be used within each priority level section, as specified in include_domains and include_uris. URIs that match these rules will be queued using the matching priority level. In addition, the following table specifies attrib elements for these sections.
Name | Type | Value | Meaning |
---|---|---|---|
share |
integer |
Specifies a weight to use for each crawl queue. This weight will only be used if the pop_scheme configuration parameter is set to wrr. |
Example
<section name="workqueue_priority">
<attrib name="levels" type="integer"> 2 </attrib>
<attrib name="default" type="integer"> 2 </attrib>
<attrib name="start_uri_pri" type="integer"> 1 </attrib>
<attrib name="pop_scheme" type="string"> wrr </attrib>
<attrib name="put_scheme" type="string"> include </attrib>
<section name="1">
<attrib name="share" type="integer"> 10 </attrib>
<section name="include_domains">
<attrib name="suffix" type="list-string">
<member> web005.contoso.com </member>
</attrib>
</section>
</section>
<section name="2">
<attrib name="share" type="integer"> 5 </attrib>
<section name="include_domains">
<attrib name="suffix" type="list-string">
<member> web002.contoso.com </member>
</attrib>
</section>
</section>
</section>
link_extraction
This section specifies which kind of hyperlinks to follow.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
a |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
action |
boolean |
yes|no |
Extracts hyperlinks from action attributes in HTML tags. Default: yes |
area |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
card |
boolean |
yes|no |
Extracts hyperlinks from the Default: yes |
comment |
boolean |
yes|no |
Extracts hyperlinks from comments in a Web item. Default: yes |
embed |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
frame |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
go |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
img |
boolean |
yes|no |
Extracts hyperlinks from Default: no |
layer |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
link |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
meta |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
meta_refresh |
boolean |
yes|no |
Extracts hyperlinks from meta refresh HTML tags ( Default: yes |
object |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
script |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
script_java |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
style |
boolean |
yes|no |
Extracts hyperlinks from Default: yes |
Example
<section name="link_extraction">
<attrib name="action" type="boolean"> yes </attrib>
<attrib name="img" type="boolean"> no </attrib>
<attrib name="link" type="boolean"> yes </attrib>
<attrib name="meta" type="boolean"> yes </attrib>
<attrib name="meta_refresh" type="boolean"> yes </attrib>
<attrib name="object" type="boolean"> yes </attrib>
<attrib name="script_java" type="boolean"> yes </attrib>
</section>
limits
The limits section specifies fail-safe limits for a crawl collection. When the collection exceeds the limit, it enters a "refresh only" crawl mode. This means that only previously-crawled URIs are crawled again.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
disk_free |
integer |
<percentage> |
Specifies the percentage of free disk space that must be available for the Web crawler to operate in normal crawl mode (specified in the crawlmode attribute). If the percentage becomes less than this limit, the Web crawler enters the "refresh only" crawl mode (when thresholds are reached). If the parameter is set to 0, this feature is disabled. Default: 0 |
disk_free_slack |
integer |
<percentage> |
Specifies slack for the disk_free threshold, as a percentage. This option creates a buffer zone around the disk_freethreshold. When the free disk space is within this buffer, the Web crawler will not change the crawl mode back to normal. This prevents the Web crawler from switching back and forth between crawl modes when the percentage of free disk space is close to the value specified by the disk_free parameter. When the freedisk space percentage exceeds disk_free + disk_free_slack, normal crawling resumes. Default: 3 |
max_doc |
integer |
<value> |
Specifies the number of stored Web items that will cause the crawler to enter "refresh" crawl mode. Hinweis The threshold is not an exact limit, because statistical reporting is somewhat delayed compared to crawling. When set to 0, this feature is disabled. Default: 0 |
max_doc_slack |
integer |
<value> |
To avoid constant changes with the crawler entering and exiting "refresh only" crawl mode, you can specify athreshold range in addition to the absolute threshold value. The range is defined as: (threshold minus slack), to (threshold), where crawl mode behavior remains unchanged. The max_doc_slack attribute specifies the maximum number of items that can be contained in a slack, up to the max_doc configuration parameter threshold. Default: 1000 |
Example
<section name="limits">
<attrib name="disk_free" type="integer"> 0 </attrib>
<attrib name="disk_free_slack" type="integer"> 3 </attrib>
<attrib name="max_doc" type="integer"> 0 </attrib>
<attrib name="max_doc_slack" type="integer"> 1000 </attrib>
</section>
focused
This section configures focused scheduling. An exclude_domains section can be used within the focused section to exclude host names from this focused scheduling. If no exclude_domains section is defined, all host names are included in the focused scheduling.
Attributes
The following table specifies attrib elements for this section.
Name |
Type |
Value |
Meaning |
languages |
list-string |
Lists the languages for items that can be stored by the Web crawler, as specified in ISO-639-1. |
|
depth |
integer |
<value> |
Specifies the number of page hops to follow for Web items that do not match the specified languages, as set by the languages configuration parameter. |
Example
In the following example, the crawler will store all items with Norwegian, English, or unknown language content. For all non-specified languages, the crawler will only follow links to 2 levels. In addition, all content under contoso.com is excluded from the language checks and is automatically stored.
<section name="focused">
<!-- Crawl Norwegian, English and content of unknown language -->
<attrib name="languages" type="list-string">
<member> norwegian </member>
<member> unknown </member>
<member> en </member>
</attrib>
<!--Follow hyperlinks containing other languages for 2 levels -->
<attrib name="depth" type="integer"> 2 </attrib>
<!-- Exclude anything under .contoso.com from language checks, -->
<section name="exclude_domains">
<attrib name="suffix" type="list-string">
<member> .contoso.com </member>
</attrib>
</section>
</section>
passwd
This section configures credentials for Web sites that require authentication. The Web crawler supports basic authentication, digest authentication, and NTLM authentication.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
name |
string |
The name attribute must contain a URI or realm. A valid URI behaves as a prefix value, because all hyperlinks extracted at its level or deeper use these authentication settings. |
The credentials must be specified in one of the following formats: The password component of the credential string can be encrypted; if it is not encrypted, it is given in plaintext. An encrypted password is created by using the crawleradmin tool with the If the credentials are given using the
|
Example
<section name="passwd">
<attrib name="https://www.contoso.com/confidential1/" type="string">
user:password:contoso:auto
</attrib>
</section>
ftp_acct
This section specifies FTP accounts for crawling FTP URIs.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
name |
string |
The value of the name XML attribute is the host name for which this FTP account is valid. |
This is the user name and password for this FTP account. The string must be in the format: |
Example
<section name="ftp_acct">
<attrib name="ftp.contoso.com" type="string"> user:pass </attrib>
</section>
exclude_headers
This section is used to exclude Web items from the crawl, based on the contents of the HTTP header fields.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
name The name attribute is used to set the name of the HTTP header to test. |
list-string |
Specifies a list of regular expressions. If the value of the specified HTTP header matches one of these regular expressions, the Web item is excluded from the crawl. |
Example
<section name="exclude_headers">
<attrib name="Header Name" type="list-string">
<member> .*excluded.*value </member>
</attrib>
</section>
variable_delay
This section specifies time slots that use a different request rate. When no time slot is specified, the crawler uses the delay configuration parameter as specified in attrib.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
name in the format: DDD:HH.MM-DDD:HH.MM |
string |
<value in seconds> suspend |
Specifies the delay request rate for this time slot, in seconds. A value of suspend specifies that crawling of this crawl collection will be suspended. |
Example
The following example shows how the Web crawler uses different delay intervals during the week. On Wednesday between 9:00 a.m. and 7:00 p.m. the Web crawler uses a delay of 20 seconds. On Monday between 9:00 a.m. and 5:00 p.m. the crawler suspends crawling, and any other time of the week the Web crawler uses a delay of 60 seconds.
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="variable_example">
<section name="variable_delay">
<attrib name="Wed:09-Wed:19" type="string">20 </attrib>
<attrib name="Mon:09-Mon:17" type="string">suspend</attrib>
</section>
</DomainSpecification>
</CrawlerConfig>
adaptive
This section specifies the adaptive crawling options. The refresh_mode configuration parameter, specified in attrib, must be set to adaptive for this section to be used by the Web crawler.
The adaptive crawling behavior can be controlled with the weights and the sitemap_weights sections.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
refresh_count |
integer |
<value> |
Specifies the number of minor refresh cycles. A refresh cycle can be divided into several fixed size time intervals that are called minor refresh cycles. Default: 4 |
refresh_quota |
integer |
<percentage> |
Specifies the ratio of existing re-crawled URIs to new unseen URIs, expressed as a percentage. Setting the percentage low gives preference to new URIs. Default: 90 |
coverage_min |
integer |
<value> |
Specifies a minimum number of URIs to crawl per Web site in a minor refresh cycle. Used to guarantee some coverage for small Web sites. Default: 25 |
coverage_max_pct |
integer |
<value> |
Specifies a percentage of a Web site to re-crawl in a minor cycle. Ensures that small Web sites are not fully crawled each minor cycle, taking time away from larger Web sites. Default: 10 |
Example
<section name="adaptive">
<attrib name="refresh_count" type="integer"> 4 </attrib>
<attrib name="refresh_quota" type="integer"> 98 </attrib>
<attrib name="coverage_max_pct" type="integer"> 25 </attrib>
<attrib name="coverage_min" type="integer"> 10 </attrib>
<!-- Ranking weights. Each scoring criteria adds a score between -->
<!-- 0.0 and 1.0 which is then multiplied with the associated -->
<!-- weight below. Use a weight of 0 to disable a scorer -->
<section name="weights">
<attrib name="inverse_length" type="real"> 1.0 </attrib>
<attrib name="inverse_depth" type="real"> 1.0 </attrib>
<attrib name="is_landing_page" type="real"> 1.0 </attrib>
<attrib name="is_mime_markup" type="real"> 1.0 </attrib>
<attrib name="change_history" type="real"> 10.0 </attrib>
</section>
</section>
weights
In this section, each URI is given a score in the adaptive crawling process. The score prioritizes URIs and is based on a set of rules. Each rule is assigned a weight that determines its contribution towards the total score that is specified in the weights section.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
inverse_length |
real |
<value> |
Specifies the weight for the inverse length rule. The inverse length rule gives URIs with few path segments (defined by the number of forward slashes) a higher score. URIs with 10 or more slashes receive a score of 0. Default: 1.0 |
inverse_depth |
real |
<value> |
Specifies the weight for the inverse depth rule. The number of page hops from a start URI is computed; a high score is assigned to URIs that have less than 10 page hops. The rule gives a score of zero for URIs with 10 or more page hops. Default: 1.0 |
is_landing_page |
real |
<value> |
Specifies the weight for the is_landing_page rule. This rule gives a URI that is considered a landing page a higher score. A landing page is a URI ending in one of /, /index.html, index.htm, index.php, index.jsp, index.asp, default.html, or default.htm. The rule gives no score to URIs that have query components. Default: 1.0 |
is_mime_markup |
real |
<value> |
Specifies the weight for the is_mime_markup rule. This rule gives an additional score to pages whose MIME type is specified in the uri_search_mime configuration parameter in attrib. Default: 1.0 |
change_history |
real |
<value> |
Specifies the weight for the change history rule. This rule scores based on the HTTP header "last-modified" value over time. Web items that frequently change have a higher score than items that change less frequently. Default: 10.0 |
sitemap |
real |
<value> |
Specifies the weight for the sitemap rule. The score for the sitemap rule is specified in sitemap_weights. Default: 10.0 |
Example
<!-- Ranking weights. Each scoring criteria adds a score between -->
<!-- 0.0 and 1.0 which is then multiplied with the associated -->
<!-- weight below. Use a weight of 0 to disable a scorer -->
<section name="weights">
<!-- Score based on the number of /'es (segments) in the -->
<!-- URI. Max score with one, no score with 10 or more -->
<attrib name="inverse_length" type="real"> 1.0 </attrib>
<!-- Score based on the number of link "levels" down to -->
<!-- this URI. Max score with none, no score with >= 10 -->
<attrib name="inverse_depth" type="real"> 1.0 </attrib>
<!-- Score added if URI is determined as a "landing page", -->
<!-- defined as e.g. ending in "/" or "index.html". URIs -->
<!-- with query parameters are not given score -->
<attrib name="is_landing_page" type="real"> 1.0 </attrib>
<!-- Score added if URI points to a markup document as -->
<!-- defined by the "uri_search_mime" option. Assumption -->
<!-- being that such content changes more often than e.g. -->
<!-- "static" Word or PDF documents. -->
<attrib name="is_mime_markup" type="real"> 1.0 </attrib>
<!-- Score based on change history tracked over time by -->
<!-- using an estimator based on last modified date given -->
<!-- by the web server. If no modified date returned then -->
<!-- one is estimated (based on whether the document has -->
<!-- changed or not). -->
<attrib name="change_history" type="real"> 10.0 </attrib>
</section>
sitemap_weights
In this section, <URL>
entries in a sitemap can contain a changefreq element, which specifies how frequently a URI can be modified.
Valid string values for this element are as follows: always, hourly, daily, weekly, monthly, yearly, and never. The string values are converted into a numeric weight for adaptive crawling. The sitemap_weights section specifies a mapping of the string values to a numeric weight. This numeric weight is used to calculate the score to the sitemap score in the weights section.
The adaptive crawling score for a URI is calculated by multiplying the numeric weight by the sitemap configuration parameter weight.
Attributes
The following table specifies attrib elements for this section.
Wichtig
The range of these elements must be from 0.0 to 1.0.
Name | Type | Value | Meaning |
---|---|---|---|
always |
real |
<value> |
Specifies the weight of the changefreq value always as a numeric value. Default: 1.0 |
hourly |
real |
<value> |
Specifies the weight of the changefreq value hourly as a numeric value. Default: 0.64 |
daily |
real |
<value> |
Specifies the weight of the changefreq value daily as a numeric value. Default: 0.32 |
weekly |
real |
<value> |
Specifies the weight of the changefreq value weekly as a numeric value. Default: 0.16 |
monthly |
real |
<value> |
Specifies the weight of the changefreq value monthly as a numeric value. Default: 0.08 |
yearly |
real |
<value> |
Specifies the weight of the changefreq value yearly as a numeric value. Default: 0.04 |
never |
real |
<value> |
Specifies the weight of the changefreq value never as a numeric value. Default: 0.0 |
default |
real |
<value> |
Specifies the weight for all URIs that are not associated with a <changefreq> value. Default: 0.16 |
Example
<section name="sitemap_weights">
<attrib name="always" type="real"> 1.0 </attrib>
<attrib name="hourly" type="real"> 0.64 </attrib>
<attrib name="daily" type="real"> 0.32 </attrib>
<attrib name="weekly" type="real"> 0.16 </attrib>
<attrib name="monthly" type="real"> 0.08 </attrib>
<attrib name="yearly" type="real"> 0.04 </attrib>
<attrib name="never" type="real"> 0.0 </attrib>
<attrib name="default" type="real"> 0.16 </attrib>
</section>
site_clusters
This section specifies configuration parameters that override the crawler's behavior of routing host names in a node scheduler. This parameter ensures that a group of host names is routed to the same node scheduler and site manager. This is useful when the use_cookies setting is enabled, because cookies are global only throughout a site manager process. Also, if you know certain Web sites are closely interlinked, you can reduce internal communication by clustering their host names.
Attributes
The following table specifies attrib elements for this section.
Name | Type | value | Meaning |
---|---|---|---|
name |
list-string |
Specifies a list of host names that should be aggregated to a node scheduler. |
Example
<section name="site_clusters">
<attrib name="mycluster" type="list-string">
<member> host1.constoso.com </member>
<member> host2.constoso.com </member>
<member> host3.constoso.com </member>
</attrib>
</section>
crawlmode
This section limits the span of a crawl collection.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
mode |
string |
Specifies the depth of the crawling. Valid values are FULL or DEPTH:#, where # is the number of page hops from a start URI. Default: FULL |
|
fwdlinks |
boolean |
yes|no |
Specifies whether to follow hyperlinks that point to a different host name. Default: yes |
fwdredirects |
boolean |
yes|no |
Specifies whether to follow external HTTP redirects received from servers. External redirects are HTTP redirects that point from one host name to another host name. Default: no |
reset_level |
boolean |
yes|no |
Specifies whether to reset the page hop counter use by mode when you follow a hyperlink to another host name. Default: yes |
Example
<section name="crawlmode">
<attrib name="mode" type="string"> DEPTH:1 </attrib>
<attrib name="fwdlinks" type="boolean"> yes </attrib>
<attrib name="fwdredirects" type="boolean"> yes </attrib>
<attrib name="reset_level" type="boolean"> no </attrib>
</section>
post_payload
This section is used to submit content to HTTP POST requests. The content is submitted to URIs that match an URI prefix or that exactly match a URI.
Attributes
The following table specifies attrib elements for this section.
Name |
Type |
Value |
Meaning |
name |
string |
Specifies the payload content string. This string is posted to URIs that matches a URI or prefix set by the name XML attribute. The section requires a match if the name attribute specifies a URI. To specify a URI prefix, the label |
Example
<section name="post_payload">
<attrib name="prefix:https://www.contoso.com/secure" type="string"> variable1=value1&variableB=valueB </attrib>
</section>
rss
This section initializes and configures RSS feed support in a crawl collection.
Attributes
The following table specifies attrib elements for this section.
Name |
Type |
Value |
Meaning |
start_uris |
list-string |
Specifies a list of start URIs that point to RSS feed items. |
|
start_uri_files |
list-string |
Specifies a list of paths to files that contain URIs that point to RSS feed items. The format of these files must be plain text files that have one URI per line. |
|
auto_discover |
boolean |
yes|no |
Specifies whether the Web crawler should discover new RSS feeds. If this option is not set, only feeds specified in the RSS start URIs and the RSS start URIs files sections will be treated as RSS feeds. Default: no |
follow_links |
boolean |
yes|no |
Specifies that the Web crawler should follow hyperlinks from Web items found in the RSS feed, which is the usual Web crawler behavior. If disabled, crawling happens only one hop away from a feed. Disable this option to only crawl feeds and Web items referenced by feeds. Default: yes |
ignore_rules |
boolean |
yes|no |
Specifies that the Web crawler should crawl all Web items referenced by the RSS feed, regardless of their inclusion in the include/exclude rules, as specified in include_domains, exclude_domains, include_uris, and exclude_uris. Default: no |
index_feed |
boolean |
yes|no |
Specifies whether the Web crawler should send RSS feeds themselves to the indexing engine, or only the Web items hyperlinked within the feeds. Default: no |
del_expired_links |
boolean |
yes|no |
Specifies whether the Web crawler should delete items from the RSS feed when they expire, as defined by max_link_age and max_link_count. Default: no |
max_link_age |
integer |
<value> |
Specifies the maximum age, in minutes, for a Web item found in an RSS feed. Only applies if the del_expired_links configuration parameter is set to yes. Default: 0 |
max_link_count |
integer |
<value> |
Specifies the maximum number of hyperlinks the Web crawler saves for an RSS feed. If the Web crawler encounters more hyperlinks, they expire in a first-in-first-out order. Only applies if del_expired_links configuration parameter is set to yes. Default: 128 |
Example
<section name="rss">
<!-- Attempt to discover new rss feeds, yes/no -->
<attrib name="auto_discover" type="boolean"> yes </attrib>
<attrib name="del_expired_links" type="boolean"> yes </attrib>
<attrib name="follow_links" type="boolean"> yes </attrib>
<attrib name="ignore_rules" type="boolean"> no </attrib>
<attrib name="index_feed" type="boolean"> no </attrib>
<attrib name="max_link_age" type="integer"> 0 </attrib>
<attrib name="max_link_count" type="integer"> 128 </attrib>
<attrib name="start_uris" type="list-string">
<member> http://www.startsiden.no/rss.rss </member>
</attrib>
<!-- Start uri files (optional) -->
<attrib name="start_uri_files" type="list-string">
<member> /usr/fast/etc/rss_seedlist.txt </member>
</attrib>
</section>
logins
This section specifies at least one logins section element for HTML forms-based authentication. These are associated with specific Web site logins, each of which must contain a unique login name in the name attribute.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
preload |
string |
<value> |
Specifies the full URI of the page to retrieve before processing the login form. |
scheme |
string |
http|https |
Specifies the URI scheme of the login Web site. Valid values: http or https |
site |
string |
<value> |
Specifies the host name of the login form page. |
form |
string |
<value> |
Specifies the path of the login form. |
action |
string |
GET|POST |
Specifies whether the form uses HTTP POST or HTTP GET. Valid values are as follows: GET or POST |
sites |
list-string |
<value> |
Specifies a list of Web sites or host names that the Web crawler should log on to before it begins the crawl process. |
ttl |
integer |
<seconds> |
Specifies the time, in seconds, that can elapse before requiring another login to continue the crawl. |
html_form |
string |
<value> |
Specifies the URI to the HTML page that contains the login form. |
autofill |
boolean |
yes|no |
Specifies whether the Web crawler should try to automatically fill out the HTML login form. The html_form configuration parameter must be specified if this attribute is set to yes. |
relogin_if_failed |
boolean |
yes|no |
Specifies whether the Web crawler can attempt to re-log on to the Web site after ttl seconds if the login failed. |
Remarks
You can use Login elements as an alternative to the logins section.
Example
<section name="logins">
<section name="mytestlogin">
<!-- Instructs the crawler to "preload" potential cookies by -->
<!-- fetching this page and register any cookies before -->
<!-- proceeding with login -->
<attrib name="preload" type="string">http://preload.contoso.com/</attrib>
<attrib name="scheme" type="string"> https </attrib>
<attrib name="site" type="string"> login.contoso.com </attrib>
<attrib name="form" type="string"> /path/to/some/form.cgi </attrib>
<attrib name="action" type="string">POST</attrib>
<section name="parameters">
<attrib name="user" type="string"> username </attrib>
<attrib name="password" type="string"> password </attrib>
<attrib name="target" type="string"> sometarget </attrib>
</section>
<!-- Host names of sites requiring this login to crawl -->
<attrib name="sites" type="list-string">
<member> site1.contoso.com </member>
<member> site2.contoso.com </member>
</attrib>
<!-- Time to live for login cookie. Will re-log in when expires -->
<attrib name="ttl" type="integer"> 7200 </attrib>
</section>
</section>
parameters
This section sets the authentication credentials that are used in a HTML form. It must be specified in a logins section, or in a Login element. The credential parameters are typically different for each HTML form.
If the autofill configuration parameter is enabled, only variables that are visible in the browser are specified. For example: username and password or equivalent. In this case, the Web crawler must retrieve the HTML page and read any "hidden" variables that are required to submit the form. A variable value that is specified in the configuration parameters will override any value that was stored in the form.
Attributes
The following table specifies attrib elements for this section.
Name | Type | Value | Meaning |
---|---|---|---|
name The name XML attribute contains the variable of the HTML form to set. |
string |
Specifies the values of the HTML form variable. |
Example
<section name="parameters">
<attrib name="user" type="string"> username </attrib>
<attrib name="password" type="string"> password </attrib>
<attrib name="target" type="string"> sometarget </attrib>
</section>
subdomains
This section specifies the configuration of crawl sub collections. The subdomains section must contain at least one section XML element, each of which specifies a crawl sub collection. A crawl sub collection section must contain a unique name by setting the name attribute
Remarks
Instead of a subdomains section, you can use a SubDomain element.
You must specify include/exclude rules to limit the scope of a crawl sub collection. These include/exclude rules are as follows: include_domains, exclude_domains, include_uris and exclude_uris.
Only a sub-set of the configuration parameters specified in attrib can be used in a sub-section. These configuration parameters are as follows:
-
accept_compression
-
allowed_schemes
-
crawlmode
-
cut_off
-
delay
-
ftp_passive
-
headers
-
max_doc
-
proxy
-
refresh
-
refresh_mode
-
start_uris
-
start_uri_files
-
use_http_1_1
-
use_javascript
-
use_sitemaps
The refresh configuration parameters of a crawl sub collection must be set lower than the refresh rate of the main crawl collection. The use_javascript, use_sitemaps, and max_doc configuration parameters cannot be used if the include_uris or exclude_uris settings are used to specify the crawl sub collection.
In addition, you can use the rss and the variable_delay sections in a crawl sub collection.
Example
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="subcollection_example">
<section name="subdomains">
<section name="subdomain_1">
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member> https://www.contoso.com/index </member>
</attrib>
</section>
<attrib name="refresh" type="real"> 60.0 </attrib>
<attrib name="delay" type="real"> 10.0 </attrib>
<attrib name="start_uris" type="list-string">
<member> https://www.contoso.com/ </member>
</attrib>
</section>
</section>
</DomainSpecification>
</CrawlerConfig>
SubDomain
This element specifies the configuration of crawl sub collections. A crawl sub collection is an object that differentiates crawl collection members from one another by their definitions. A crawl collection can contain multiple SubDomain elements.
The configuration parameters for a SubDomain element are specified in subdomains.
A SubDomain element contains attrib elements and section elements.
Attributes
Attribute | Value | Meaning |
---|---|---|
name |
<name> |
A string specifying the name of the crawl sub collection. |
Example
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="subcollection_example">
<SubDomain name="subdomain_1">
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member> https://www.contoso.com/index </member>
</attrib>
</section>
<attrib name="refresh" type="real"> 60.0 </attrib>
<attrib name="delay" type="real"> 10.0 </attrib>
<attrib name="start_uris" type="list-string">
<member> https://www.contoso.com/ </member>
</attrib>
</SubDomain>
</DomainSpecification>
</CrawlerConfig>
Login
This element is used for HTML forms-based authentication. The configuration parameters for a Login element are specified in logins. A crawl collection can contain multiple Login elements. A Login element contains attrib elements and section elements.
Attributes
Attribute | Value | Meaning |
---|---|---|
name |
<value> |
A string specifying the name of the login specification. |
Example
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="login_example">
<Login name="mytestlogin">
<attrib name="preload" type="string">http://preload.contoso.com/
</attrib>
<attrib name="scheme" type="string"> https </attrib>
<attrib name="site" type="string"> login.contoso.com </attrib>
<attrib name="form" type="string"> /path/to/some/form.cgi </attrib>
<attrib name="action" type="string">POST</attrib>
<section name="parameters">
<attrib name="user" type="string"> username </attrib>
<attrib name="password" type="string"> password </attrib>
</section>
<attrib name="sites" type="list-string">
<member> site1.contoso.com </member>
<member> site2.contoso.com </member>
</attrib>
<attrib name="ttl" type="integer"> 7200 </attrib>
<attrib name="html_form" type="string">
http://login.contoso.com/login.html
</attrib>
<attrib name="autofill" type="boolean"> yes </attrib>
<attrib name="relogin_if_failed" type="boolean"> yes </attrib>
</Login>
</DomainSpecification>
</CrawlerConfig>
Node
This element is used to override configuration parameters in a crawl collection or a crawl sub collection for a particular node scheduler. The configuration parameters for a Node element are specified in SubDomain, Login, attrib and section.
A Node element contains attrib elements and section elements.
Attributes
Attribute | Value | Meaning |
---|---|---|
name |
<value> |
A string specifying the node scheduler for these configuration parameters. |
Example
The following example uses a multi-node installation. One of the node schedulers is named "crawler_node1". This configures the "crawler_node1" with a different delay configuration parameter than the other nodes.
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="node_example ">
<attrib name="delay" type="real"> 60.0 </attrib>
<Node name="crawler_node1">
<attrib name="delay" type="real"> 90.0 </attrib>
</Node>
</DomainSpecification>
</CrawlerConfig>
XML schema
A Web crawler configuration file must be formatted according to the following XML schema:
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="CrawlerConfig" type="CT_CrawlerConfig"/>
<xs:complexType name="CT_CrawlerConfig >
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="DomainSpecification" type="CT_DomainSpecification"/>
</xs:choice>
</xs:complexType>
<xs:complexType name="CT_DomainSpecification">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="attrib" type="CT_attrib" maxOccurs="unbounded"/>
<xs:element name="section" type="CT_section"/>
<xs:element name="SubDomain" type="CT_SubDomain"/>
<xs:element name="Login" type="CT_Login"/>
<xs:element name="Node" type="CT_Node"/>
</xs:choice>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:complexType name="CT_attrib" mixed="true">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="member" type="ST_member"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="type" type="ST_type" use="required"/>
</xs:complexType>
<xs:complexType name="CT_section">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="attrib" type="CT_attrib"/>
<xs:element name="section" type="CT_section"/>
</xs:choice>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:complexType name="CT_SubDomain">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="attrib" type="CT_attrib"/>
<xs:element name="section" type="CT_section"/>
</xs:choice>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:complexType name="CT_Login">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="attrib" type="CT_attrib"/>
<xs:element name="section" type="CT_section"/>
</xs:choice>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:complexType name="CT_Node">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="attrib" type="CT_attrib"/>
<xs:element name="section" type="CT_section"/>
</xs:choice>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
<xs:simpleType name="ST_type">
<xs:restriction base="xs:string">
<xs:enumeration value="boolean"/>
<xs:enumeration value="string"/>
<xs:enumeration value="integer"/>
<xs:enumeration value="list-string"/>
<xs:enumeration value="real"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="ST_member">
<xs:restriction base="xs:string"></xs:restriction>
</xs:simpleType>
</xs:schema>
Simple configuration
The following example configures a simple Web crawler configuration. It is configured to crawl only the contoso.com Web site.
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="default_example">
<section name="crawlmode">
<attrib name="fwdlinks" type="boolean"> no </attrib>
<attrib name="fwdredirects" type="boolean"> no </attrib>
<attrib name="mode" type="string"> FULL </attrib>
<attrib name="reset_level" type="boolean"> no </attrib>
</section>
<attrib name="start_uris" type="list-string">
<member> https://www.contoso.com </member>
</attrib>
</DomainSpecification>
</CrawlerConfig>
Typical configuration
The following example crawler configuration contains some common configuration parameters.
<?xml version="1.0" encoding="utf-8"?>
<CrawlerConfig>
<DomainSpecification name="default_example">
<attrib name="accept_compression" type="boolean"> yes </attrib>
<attrib name="allowed_schemes" type="list-string">
<member> http </member>
<member> https </member>
</attrib>
<attrib name="allowed_types" type="list-string">
<member> text/html </member>
<member> text/plain </member>
</attrib>
<section name="cachesize">
<attrib name="aliases" type="integer"> 1048576 </attrib>
<attrib name="pp" type="integer"> 1048576 </attrib>
<attrib name="pp_pending" type="integer"> 131072 </attrib>
<attrib name="routetab" type="integer"> 1048576 </attrib>
</section>
<attrib name="check_meta_robots" type="boolean"> yes </attrib>
<attrib name="cookie_timeout" type="integer"> 900 </attrib>
<section name="crawlmode">
<attrib name="fwdlinks" type="boolean"> yes </attrib>
<attrib name="fwdredirects" type="boolean"> yes </attrib>
<attrib name="mode" type="string"> FULL </attrib>
<attrib name="reset_level" type="boolean"> no </attrib>
</section>
<attrib name="csum_cut_off" type="integer"> 0 </attrib>
<attrib name="cut_off" type="integer"> 5000000 </attrib>
<attrib name="dbswitch" type="integer"> 5 </attrib>
<attrib name="dbswitch_delete" type="boolean"> no </attrib>
<attrib name="delay" type="real"> 60.0 </attrib>
<attrib name="domain_clustering" type="boolean"> no </attrib>
<attrib name="enforce_delay_per_ip" type="boolean"> yes </attrib>
<attrib name="exclude_exts" type="list-string">
<member> .jpg </member>
<member> .jpeg </member>
<member> .ico </member>
<member> .tif </member>
<member> .png </member>
<member> .bmp </member>
<member> .gif </member>
<member> .wmf </member>
<member> .avi </member>
<member> .mpg </member>
<member> .wmv </member>
<member> .wma </member>
<member> .ram </member>
<member> .asx </member>
<member> .asf </member>
<member> .mp3 </member>
<member> .wav </member>
<member> .ogg </member>
<member> .ra </member>
<member> .aac </member>
<member> .m4a </member>
<member> .zip </member>
<member> .gz </member>
<member> .vmarc </member>
<member> .z </member>
<member> .tar </member>
<member> .iso </member>
<member> .img </member>
<member> .rpm </member>
<member> .cab </member>
<member> .rar </member>
<member> .ace </member>
<member> .hqx </member>
<member> .swf </member>
<member> .exe </member>
<member> .java </member>
<member> .jar </member>
<member> .prz </member>
<member> .wrl </member>
<member> .midr </member>
<member> .css </member>
<member> .ps </member>
<member> .ttf </member>
<member> .mso </member>
<member> .dvi </member>
</attrib>
<attrib name="extract_links_from_dupes" type="boolean"> no </attrib>
<attrib name="fetch_timeout" type="integer"> 300 </attrib>
<attrib name="force_mimetype_detection" type="boolean"> no </attrib>
<section name="ftp_errors">
<attrib name="4xx" type="string"> DELETE:3 </attrib>
<attrib name="550" type="string"> DELETE:0 </attrib>
<attrib name="5xx" type="string"> DELETE:3 </attrib>
<attrib name="int" type="string"> KEEP:0 </attrib>
<attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
<attrib name="ttl" type="string"> DELETE:3 </attrib>
</section>
<attrib name="headers" type="list-string">
<member> User-Agent: FAST Enterprise Crawler 6 </member>
</attrib>
<attrib name="html_redir_is_redir" type="boolean"> yes </attrib>
<attrib name="html_redir_thresh" type="integer"> 3 </attrib>
<section name="http_errors">
<attrib name="4xx" type="string"> DELETE:0 </attrib>
<attrib name="5xx" type="string"> DELETE:10 </attrib>
<attrib name="int" type="string"> KEEP:0 </attrib>
<attrib name="net" type="string"> DELETE:3, RETRY:1 </attrib>
<attrib name="ttl" type="string"> DELETE:3 </attrib>
</section>
<attrib name="if_modified_since" type="boolean"> yes </attrib>
<attrib name="javascript_keep_html" type="boolean"> no </attrib>
<section name="limits">
<attrib name="disk_free" type="integer"> 0 </attrib>
<attrib name="disk_free_slack" type="integer"> 3 </attrib>
<attrib name="max_doc" type="integer"> 0 </attrib>
<attrib name="max_doc_slack" type="integer"> 1000 </attrib>
</section>
<section name="link_extraction">
<attrib name="a" type="boolean"> yes </attrib>
<attrib name="action" type="boolean"> yes </attrib>
<attrib name="area" type="boolean"> yes </attrib>
<attrib name="card" type="boolean"> yes </attrib>
<attrib name="comment" type="boolean"> no </attrib>
<attrib name="embed" type="boolean"> no </attrib>
<attrib name="frame" type="boolean"> yes </attrib>
<attrib name="go" type="boolean"> yes </attrib>
<attrib name="img" type="boolean"> no </attrib>
<attrib name="layer" type="boolean"> yes </attrib>
<attrib name="link" type="boolean"> yes </attrib>
<attrib name="meta" type="boolean"> yes </attrib>
<attrib name="meta_refresh" type="boolean"> yes </attrib>
</section>
<section name="log">
<attrib name="dsfeed" type="string"> text </attrib>
<attrib name="fetch" type="string"> text </attrib>
<attrib name="postprocess" type="string"> text </attrib>
<attrib name="site" type="string"> text </attrib>
</section>
<attrib name="login_failed_ignore" type="boolean"> no </attrib>
<attrib name="login_timeout" type="integer"> 300 </attrib>
<attrib name="max_backoff_counter" type="integer"> 50 </attrib>
<attrib name="max_backoff_delay" type="integer"> 600 </attrib>
<attrib name="max_doc" type="integer"> 1000000 </attrib>
<attrib name="max_pending" type="integer"> 2 </attrib>
<attrib name="max_redirects" type="integer"> 10 </attrib>
<attrib name="max_reflinks" type="integer"> 0 </attrib>
<attrib name="max_sites" type="integer"> 128 </attrib>
<attrib name="max_uri_recursion" type="integer"> 5 </attrib>
<attrib name="mufilter" type="integer"> 0 </attrib>
<attrib name="near_duplicate_detection" type="boolean"> no </attrib>
<attrib name="obey_robots_delay" type="boolean"> no </attrib>
<section name="pp">
<attrib name="ds_max_ecl" type="integer"> 10 </attrib>
<attrib name="ds_meta_info" type="list-string">
<member> duplicates </member>
<member> redirects </member>
<member> mirrors </member>
<member> metadata </member>
</attrib>
<attrib name="ds_paused" type="boolean"> no </attrib>
<attrib name="ds_send_links" type="boolean"> no </attrib>
<attrib name="max_dupes" type="integer"> 10 </attrib>
<attrib name="stripe" type="integer"> 1 </attrib>
</section>
<section name="ppdup">
<attrib name="compact" type="boolean"> yes </attrib>
</section>
<attrib name="proxy_max_pending" type="integer"> 2147483647 </attrib>
<attrib name="refresh" type="real"> 1440.0 </attrib>
<attrib name="refresh_mode" type="string"> scratch </attrib>
<attrib name="refresh_when_idle" type="boolean"> no </attrib>
<attrib name="robots" type="boolean"> yes </attrib>
<attrib name="robots_auth_ignore" type="boolean"> yes </attrib>
<attrib name="robots_timeout" type="integer"> 300 </attrib>
<attrib name="robots_tout_ignore" type="boolean"> no </attrib>
<attrib name="robots_ttl" type="integer"> 86400 </attrib>
<section name="rss">
<attrib name="auto_discover" type="boolean"> no </attrib>
<attrib name="del_expired_links" type="boolean"> no </attrib>
<attrib name="follow_links" type="boolean"> no </attrib>
<attrib name="ignore_rules" type="boolean"> no </attrib>
<attrib name="index_feed" type="boolean"> no </attrib>
<attrib name="max_link_age" type="integer"> 0 </attrib>
<attrib name="max_link_count" type="integer"> 128 </attrib>
</section>
<attrib name="smfilter" type="integer"> 0 </attrib>
<attrib name="sort_query_params" type="boolean"> no </attrib>
<attrib name="start_uris" type="list-string">
<member> https://www.contoso.com </member>
</attrib>
<section name="storage">
<attrib name="clusters" type="integer"> 8 </attrib>
<attrib name="compress" type="boolean"> yes </attrib>
<attrib name="compress_exclude_mime" type="list-string">
<member> application/x-shockwave-flash </member>
</attrib>
<attrib name="datastore" type="string"> bstore </attrib>
<attrib name="defrag_threshold" type="integer"> 85 </attrib>
<attrib name="remove_docs" type="boolean"> no </attrib>
<attrib name="store_dupes" type="boolean"> no </attrib>
<attrib name="store_http_header" type="boolean"> yes </attrib>
</section>
<attrib name="truncate" type="boolean"> no </attrib>
<attrib name="umlogs" type="boolean"> yes </attrib>
<attrib name="uri_search_mime" type="list-string">
<member> text/html </member>
<member> text/vnd.wap.wml </member>
<member> text/wml </member>
<member> text/x-wap.wml </member>
<member> x-application/wml </member>
<member> text/x-hdml </member>
</attrib>
<attrib name="use_cookies" type="boolean"> no </attrib>
<attrib name="use_http_1_1" type="boolean"> yes </attrib>
<attrib name="use_javascript" type="boolean"> no </attrib>
<attrib name="use_meta_csum" type="boolean"> no </attrib>
<attrib name="use_sitemaps" type="boolean"> no </attrib>
<section name="workqueue_priority">
<attrib name="default" type="integer"> 1 </attrib>
<attrib name="levels" type="integer"> 1 </attrib>
<attrib name="pop_scheme" type="string"> default </attrib>
<attrib name="start_uri_pri" type="integer"> 1 </attrib>
</section>
</DomainSpecification>
</CrawlerConfig>