SharePoint 2013 and CEWS: Searchable Extended File System Properties

For more information about CEWS or the CEWS toolkit,visit this resource:

https://social.technet.microsoft.com/wiki/contents/articles/19376.sharepoint-cews-pipeline-toolkit.aspx

 

Out of the box, SharePoint 2013 takes a "best effort" approach at pulling out searchable properties from items during crawl time. This process can vary quite a bit though depending on what type of crawl you have going (SharePoint, File System, Custom, etc...). A recent request I worked on was to enhance the experience around a file system crawl. Checking the properties of a file, there are a bunch of extended properties based on the item type. For something like an image, there are some pretty common properties like Height, Width, Bit Depth, etc. Video files might have a Frame Rate or a Length. As an example, these properties are what I wanted:

 

 

After some research, it seems there are 2 major options available to pick up these properties:

  1. Use an appropriate iFilter for the types of content you are crawling
  2. Use CEWS and a little custom code

The iFilter approach is a fairly robust one. iFilters are run at crawl time to extract all kinds of properties and the logic there is shared across the system (not just for SharePoint). Using CEWS as an approach is less intrusive in this case as it will only impact your SharePoint crawl. In this article, we will take a look at the CEWS approach.

To begin, I needed to figure out how to use C# to access the properties that are available via windows explorer. It turns out that there are some COM-based hooks to get access to this information and it looks a little like this:

  Type shelltype = Type.GetTypeFromProgID("Shell.Application");
 Object shell = Activator.CreateInstance(shelltype); 
 
 FileInfo fi = new FileInfo(@"f:\fs_crawl\videos\drop.avi");
 Folder folder = (Folder)shelltype.InvokeMember("NameSpace", BindingFlags.InvokeMethod, null, shell, new object[] { fi.DirectoryName }); 
 FolderItem item = folder.ParseName(fi.Name); 
 
 for (int i = 0; i < 512; i++) 
 { 
 var dtlDesc = folder.GetDetailsOf(null, i); 
 var dtlVal = folder.GetDetailsOf(item, i); 
 
 if (String.IsNullOrEmpty(dtlVal)) 
 continue; 
 
 Console.WriteLine("{0}:{1}", dtlDesc, dtlVal); 
 }

There's a lot going on in this little snippet, so let's take it apart. The code is essentially creating a hook into the shell and giving it the path to a file (drop.avi). It is then enumerating through the details (read: properties) of the item up to offset 512. Under the hood, each property is stored at a particular offset for different file types. This could be made more efficient by specifying the offsets for known file types rather than enumerating, but not knowing those offsets initially this approach made more sense for the sake of example.

Now that I had the ability to pull properties from file system items, I cleaned up the code and created a new custom processing stage using the CEWS Toolkit. Using CEWS, there is the consideration that the file would NOT live on the same system that the CEWS Service is running on. This is alleviated by the fact that this code snippet works with a UNC path. In my lab, I run the CEWS Service under the same user account that performs the crawl. This guarantees that my CEWS stage has the same rights to access files as the crawler does.

Bringing this all together, here are the managed properties I created to support this:

 

As input, I primarily needed the Path and IsDocument. I included a check to avoid trying to get any properties from folders. As an enhancement, a trigger could be created that only allowed items into CEWS when IsDocument is true.

I created a number of new Managed Properties to hold data for image and video sizes and lengths. I made a number of these into refiners as well. Once it was all wired up and crawled, here are the populated managed properties coming back in search: