Sample: Parsing Content in C# Using IFilter
I’m working on a 3 (or 4) part tutorial right now that requires parsing of PDF files. The code started to get big enough I decided to pull it out and turn it into a new post that I can use in the series (stay tuned).
There are several solutions for reading through various file formats. The IFilter interface was defined to help Windows do search indexing on files for this purpose. There are lot’s of filter providers for various formats, including several from Microsoft. If you want to parse PDF files you’ll need to have a provider installed for that as well. The FoxIt IFilter download page has a provider that according to their website is free for client use (my case).
In looking around for some sample code I found a few examples that did close to what I wanted but didn’t have a lot of luck finding a C# example. I’ve pulled together various pieces of code to create a basic implementation for my (simple) needs. You can find some interesting links here:
- MSDN Documentation for IFilter
- Codeplex IFilter sample code in C++ (under MS-PL)
- P-Invoke definitions for IFilter and related members from www.pinvoke.net
The sample contains a class library for parsing the code and a console application that can be used to exercise the library against files. The code is built using a current internal build of VS2010 (stay tuned here for beta notice) but the key code (FilterCode.cs) should work fine on previous versions of VS and .NET Framework.
I’ve uploaded the solution to the MSDN code gallery here:
To use the sample, include FilterCode.cs in your project, create a new instance of FilterLibrary.FilterCode, and call the GetTextFromDocument method against the file you want to parse. If you have a filter installed for that document type, you will get back a StringBuilder with the text contents of the file.
Enjoy!
Comments
Anonymous
August 31, 2009
The comment has been removedAnonymous
August 31, 2009
Sorry Doug, I inadvertently used a new API from Beta 2. I just posted a new version that works against Beta 1. The fix is simple: construct the temporary StringBuilder inline rather than reuse the original.Anonymous
August 31, 2009
Look at ItextSharp a .net PDF read/write library. It's a .net port of the iText (java) library. It supports most all of PDF except for Jbig2Anonymous
September 02, 2009
Greg - thanks for the pointer to IText. In my case the IFilter solution will work against any underlying file format, so if the document type is changed to DOC* or XPS, etc.Anonymous
September 04, 2009
Mr. Zander, When will we be getting VS 2010 Bata 2? As a Microsoft Certified Partner, do we have any way(s) of getting VS 2010 Beta 2 a bit earlier than general public? We are required to build WF, Silverlight 3.0, workflow/orchestration based product and instead of .NET Framework v3.5, we hope to take advantage of 2010 Beta 2 ASAP. Thanks! Jason Huang Executive VP & Co-Founder enfoTech & Consulting Inc. (609) 896-9777 x108Anonymous
September 10, 2009
Jason - We're not ready to publish a release date for Beta 2 just yet, although we are making lots of great progress. For projects with near term deliverables .NET Framework 3.5 is still the right target. Stay tuned to this blog for more info...Anonymous
September 12, 2009
Hey Jason, I used IFIlter to extract text from a few different file formats in this 'basic' ASP.NET search engine http://searcharoo.net/SearcharooV4/ (although I eventually wrote a specific handler for PDF using ITextSharp). The IFilter stuff is based on this C# IFilter sample http://www.codeproject.com/KB/cs/IFilter.aspx Not sure how it compares to yours...