Sample: Parsing Content in C# Using IFilter

I’m working on a 3 (or 4) part tutorial right now that requires parsing of PDF files.  The code started to get big enough I decided to pull it out and turn it into a new post that I can use in the series (stay tuned).

There are several solutions for reading through various file formats.  The IFilter interface was defined to help Windows do search indexing on files for this purpose.  There are lot’s of filter providers for various formats, including several from Microsoft.  If you want to parse PDF files you’ll need to have a provider installed for that as well.  The FoxIt IFilter download page has a provider that according to their website is free for client use (my case).

In looking around for some sample code I found a few examples that did close to what I wanted but didn’t have a lot of luck finding a C# example.  I’ve pulled together various pieces of code to create a basic implementation for my (simple) needs.  You can find some interesting links here:

The sample contains a class library for parsing the code and a console application that can be used to exercise the library against files.  The code is built using a current internal build of VS2010 (stay tuned here for beta notice) but the key code (FilterCode.cs) should work fine on previous versions of VS and .NET Framework.

I’ve uploaded the solution to the MSDN code gallery here:

To use the sample, include FilterCode.cs in your project, create a new instance of FilterLibrary.FilterCode, and call the GetTextFromDocument method against the file you want to parse.  If you have a filter installed for that document type, you will get back a StringBuilder with the text contents of the file.

Enjoy!

Comments

  • Anonymous
    August 31, 2009
    The comment has been removed

  • Anonymous
    August 31, 2009
    Sorry Doug, I inadvertently used a new API from Beta 2.  I just posted a new version that works against Beta 1.  The fix is simple:  construct the temporary StringBuilder inline rather than reuse the original.

  • Anonymous
    August 31, 2009
    Look at ItextSharp a .net PDF read/write library.  It's a .net port of the iText (java) library.  It supports most all of PDF except for Jbig2

  • Anonymous
    September 02, 2009
    Greg - thanks for the pointer to IText.  In my case the IFilter solution will work against any underlying file format, so if the document type is changed to DOC* or XPS, etc.

  • Anonymous
    September 04, 2009
    Mr. Zander, When will we be getting VS 2010 Bata 2?  As a Microsoft Certified Partner, do we have any way(s) of getting VS 2010 Beta 2 a bit earlier than general public? We are required to build WF, Silverlight 3.0, workflow/orchestration based product and instead of .NET Framework v3.5, we hope to take advantage of 2010 Beta 2 ASAP. Thanks! Jason Huang Executive VP & Co-Founder enfoTech & Consulting Inc. (609) 896-9777 x108

  • Anonymous
    September 10, 2009
    Jason - We're not ready to publish a release date for Beta 2 just yet, although we are making lots of great progress.  For projects with near term deliverables .NET Framework 3.5 is still the right target.  Stay tuned to this blog for more info...

  • Anonymous
    September 12, 2009
    Hey Jason, I used IFIlter to extract text from a few different file formats in this 'basic' ASP.NET search engine http://searcharoo.net/SearcharooV4/ (although I eventually wrote a specific handler for PDF using ITextSharp). The IFilter stuff is based on this C# IFilter sample http://www.codeproject.com/KB/cs/IFilter.aspx Not sure how it compares to yours...