Integrating an External Item Processing Component

Applies to: SharePoint Server 2010

In this article
Pipeline Extensibility Overview
Command Environment
Pipeline Extensibility vs. IFilter
Handling Crawled Properties
Customizing pipelineextensibility.xml
File Format for pipelineextensibility.xml
Input/Output File Format
Creating a Custom Command

Some applications may require extensions to the item processing. You can generate additional searchable metadata from the content, or forward a piece of information from the item to a third-party application for statistical or monitoring purposes.

This topic describes how create an external item processing component for FAST Search Server 2010 for SharePoint, as follows:

Pipeline Extensibility Overview

The item processing pipeline in FAST Search Server 2010 for SharePoint performs various tasks, such as text extraction, language detection, and tokenization to prepare an item from a content source for indexing and searching.

The pipeline extensibility interface enables you to run custom processing commands for each item that is fed through the pipeline. A command takes a set of crawled properties as input, processes them, and outputs another set of crawled properties. The communication between the item processing pipeline and the custom command occurs by using temporary XML files. The pipeline extensibility processing takes place before crawled properties are mapped to managed properties.

The input crawled properties may be metadata retrieved from the content source, or you can use custom XML item processing (see Custom XML Item Processing) on retrieved XML items to create crawled properties from the XML data.

You specify the commands to run in pipelineextensibility.xml. The format for this configuration file is specified in the Pipeline Extensibility Configuration Schema. For each command, list the crawled properties that the command will receive as input, and list the crawled properties the command will output.

Important

Any changes that you make to this file will be overwritten and lost if you install a FAST Search Server 2010 for SharePoint update or service pack.

This configuration file is not backed up by the standard FAST Search Server 2010 for SharePoint backup procedure. To avoid losing your changes, ensure that you back up this file after you modify it.

Be sure to reapply your changes to the configuration file after you install a FAST Search Server 2010 for SharePoint update or service pack.

Command Environment

Before a command is run, the input crawled properties are collected in a temporary XML file. The command then reads that file and outputs the resulting crawled properties to another temporary XML file. The paths to the input and output files are available as special string sequences when specifying the command to run. The file format for the input and output file is identical and is specified in the Pipeline Extensibility Interface Schema.

The commands are run in a sandboxed environment that uses limited processing time, memory, and number of processes. Apart from the input and output file the commands will have only read/write access to files that are publicly accessible.

Full network access is provided. If you do not need network access, create a firewall rule to block your application in case it is compromised.

If you have problems running the command in a sandboxed environment, consider changing the code to forward the input to a separate process or service that uses sockets, and then let it perform the main work.

Pipeline Extensibility vs. IFilter

To implement text extraction for a new document format, implement an IFilter and register it with FAST Search Server 2010 for SharePoint. You can also implement text extraction support with the pipeline extensibility feature. For more information, see Configure FAST Search Server for SharePoint to use a Third-Party IFilter.

Handling Crawled Properties

Crawled properties are metadata that is extracted from content sources to make the data available for searching. Crawled properties are typically reported by the Content SSA or other FAST Search Server 2010 for SharePoint connectors, but can also be created during item processing by an IFilter or a custom processing stage.

A crawled property is uniquely defined by Name, Propset, and VariantType. You can create a custom crawled property by using the Windows PowerShell cmdlet New-FASTSearchMetadataCrawledProperty. You can choose to use an existing Propset, or create a Propset for the custom crawled properties.

You can map each crawled property to a managed property, or you can define a category of crawled properties and map the entire category to the default full-text index. For more information, see Manage Crawled Properties by Using Windows PowerShell (FAST Search Server 2010 for SharePoint) and Index Schema Reference.

Only crawled property types that can be mapped to a managed property are supported.

A special property set contains crawled properties that are created inside the item processing pipeline. A subset of these read-only properties may be passed to the pipeline extensibility command. This includes the url property, data property, and body property. The data property contains the binary content of the source document encoded in base64. For more information, see CrawledProperty Element [Pipeline Extensibility Configuration Schema].

Customizing pipelineextensibility.xml

Use a text editor or XML editor to customize this file.

Note

To modify a configuration file, you must be a member of the FASTSearchAdministrators local group on the computer where FAST Search Server 2010 for SharePoint is installed.

To edit pipelineextensibility.xml

  1. On each server in the FAST Search Server 2010 for SharePoint farm deployment, update %FASTSEARCH%\etc\pipelineextensibility.xml with your changes.

  2. On the FAST Search Server 2010 for SharePoint administration node, run the command psctrl reset to reset all currently running item processors in the system.

  3. Save a backup of the file, as this file is not part of the configuration backup/restore process in FAST Search Server 2010 for SharePoint.

  4. You will need to update this file from your backup after applying a software update. For more information, see the note in Pipeline Extensibility Overview.

File Format for pipelineextensibility.xml

The following is the basic structure of the pipelineextensibility.xml file.

<PipelineExtensibility>
    <Run command='CommandName'>
        <Input>
            <CrawledProperty propertySet='GUID' propertyName='PropertyName' propertyId='PropertyId' varType='PropertyType'>
        </Input>
        <Output>
            <CrawledProperty propertySet='GUID' propertyName='PropertyName' propertyId='PropertyId' varType='PropertyType' defaultValue='DefaultValue'>
        </Output>
    </Run>
</PipelineExtensibility>

For more information about the XML syntax, see Pipeline Extensibility Configuration Schema.

You may specify multiple Run elements, which will be processed in the order they occur in the configuration file. You may also specify multiple CrawledProperty elements in the Input and Output elements.

The following example shows the configuration of a command that reads the Author property (identified by property ID 4) from the summary information property set, and outputs another property named MyProperty. The paths to the input and output files are available as the special string sequences "%(input)s" and "%(output)s", as follows.

<PipelineExtensibility>
  <Run command="sample.exe %(input)s %(output)s">
    <Input>
      <CrawledProperty propertySet="f29f85e0-4ff9-1068-ab91-08002b27b3d9" varType="31" propertyId="4"/>
    </Input>
    <Output>
      <CrawledProperty propertySet="d5cdd505-2e9c-101b-9397-08002b2cf9ae" varType="31" propertyName="MyProperty"/>
    </Output>
  </Run>
</PipelineExtensibility>

You may retrieve the propertySet and varType for a crawled property by using the Windows PowerShell cmdlet Get-FASTSearchMetadataCrawledProperty. The following example shows how to do this for the crawled property named fileextension.

PS C:\FASTSearch\etc> Get-FASTSearchMetadataCrawledProperty -name fileextension

CategoryName       : Basic
IsMappedToContents : False
IsNameEnum         : False
IsMultiValued      : False
Name               : FileExtension
Propset            : 0b63e343-9ccc-11d0-bcdb-00805fccce04
VariantType        : 31

Input/Output File Format

The intermediate input and output files that contain the crawled properties use the same XML format.

The following is the basic structure of the intermediate files.

<Document>
 <CrawledProperty propertySet='GUID' propertyName='PropertyName' propertyId='PropertyId' varType='PropertyType'>propertyValue</CrawledProperty>
</Document>

For information about the XML syntax, see Pipeline Extensibility Interface Schema.

The following example specifies the input file for the example shown in File Format for pipelineextensibility.xml. The example assumes that the author was John Doe.

<Document>
  <CrawledProperty propertySet="f29f85e0-4ff9-1068-ab91-08002b27b3d9" 
                   varType="31" propertyId="4">John Doe</CrawledProperty>
</Document>

The input file will always be encoded in UTF-8 by the FAST Search Server 2010 for SharePoint item processing pipeline. The output file constructed by the command must be encoded in either UTF-8 or UTF-16 to ensure that all characters are interpreted correctly.

Creating a Custom Command

The special string sequence "%(input)s" in the command attribute will be replaced by the actual path of the input file. If the Input element has any CrawledProperty elements, this sequence is required.

The special string sequence "%(output)s" in command will be replaced by the actual path of the output file. If the Output element has any CrawledProperty elements, this sequence is required and a valid output file must be generated. The output file must at least contain the top-level Document element, as follows.

<Document></Document>

The command is permitted to output a subset of the listed CrawledProperty elements.

The FAST Search Server 2010 for SharePoint item processing pipeline uses the exit code of the command to determine whether it ran successfully. An exit code of zero indicates that the command succeeded and was able to run from start to finish. A non-zero exit code should be used only when an abnormal situation occurs that prevents the command from performing its task (for example, errors in a configuration file, a missing DLL, or unhandled exceptions ). In such cases, the command should log a short message to the standard output or standard error explaining the problem. This message will be forwarded to the crawl log.

Before the configuration is applied to the installation, verify that the command actually works as expected on each node by creating sample input files and manually invoking the command together with the input files.

The following example provides a simple C# example of sample.exe in the pipelineextensibility.xml file format section. It prefixes the input property by using a fixed string "Mr./Mrs." and outputs the result to a new property.

using System;using System.Collections.Generic;
using System.Linq;using System.Xml.Linq;
using System.Text;namespace 
Sample{
    class Sample
    {
        public static readonly Guid FMTID_SummaryInformation = new Guid("F29F85E0-4FF9-1068-AB91-08002B27B3D9");
        public static readonly Guid FMTID_UserDefinedProperties = new Guid("D5CDD505-2E9C-101B-9397-08002B2CF9AE");
        static void Main(string[] args)
        {            // Fetch the author property from the input item
            XDocument inputDoc = XDocument.Load(args[0]);
            var res = from cp in inputDoc.Descendants("CrawledProperty")
                      where new Guid(cp.Attribute("propertySet").Value).Equals(FMTID_SummaryInformation) &&
                      cp.Attribute("propertyId").Value == "4" &&
                      cp.Attribute("varType").Value == "31"
                      select cp.Value;            // Create the output item
            XElement outputElement = new XElement("Document");
            // Add a crawled property where the author is prefixed with "Mr./Mrs."
            if (res.Count() > 0 && res.First().Length > 0)
            {
                outputElement.Add(
                    new XElement("CrawledProperty",
                        new XAttribute("propertySet", FMTID_UserDefinedProperties),
                        new XAttribute("propertyName", "MyProperty"),
                        new XAttribute("varType", 31),
                            "Mr./Mrs. " + res.First())
                        );
            }
            outputElement.Save(args[1]);
        }
    }
}

See Also

Concepts

Pipeline Extensibility Configuration Schema

Pipeline Extensibility Interface Schema

Custom XML Item Processing