.NET Regex Data Services Library for HTML (RDSH)
Technorati Tags: Regex Data Services HTML .Net C# Parser Screen Scraping
1.1 Rationale and Purpose
Every now and then we all run into situations where we need to interface with legacy web applications and sometimes there is no real solid way to do this without the so called “screen-scraping” option. Of course in ideal situations it would be preferable to go directly to the data in concern and bypass the existing UI. In less than ideal situations we sometimes have no choice but to find ways to convert raw HTML formatted web pages into structured data. As we all know doing this consistently is not trivial.
If you are going after a few pages that rarely change or that are similarly formatted there might a temptation to simply hardcode the parsing logic with HTML tag delimiters in one’s code. Obviously this is not recommended no matter how small the solution is.
1.2 The Regex Data Services for HTML (RDSH) Solution
This is where the RDSH solution comes in. I developed and have been using this solution for over 8 years and have perfected it’s accuracy over time. In other words the parsing logic is as accurate as can be. Regex Data Services for HTML (RDSH) is a rich set of text parsing library that allows one to declaratively define fields within an HTML document using tags as a way to uniquely identify each field. Once the data fields are defined you are essentially dealing with these elements just like you would a database table row. The advantage is that the RDSH library abstracts the Regex and parsing implementation so that you stay at the higher conceptual level. The effort involved is in defining the data elements in your project, once that is complete the RDSH system is ready to go to work for you combing the internet per your commands.
1.3 RDSH Conceptual Diagram
1.4 Structure Parser Design
Because RDSH uses regular expressions it makes it a breeze to parse out specific field and data types such as emails, phone numbers etc.
By defining a field element as being of a particular data type automatically tells the RDSH system to parse it accordingly using the inbuilt data type parsers for the above listed field types. For these types there is no custom code required as the system automatically identifies these types.
1.5 IProject and IRowSource
The RDSH system relies on a project definition. A project can have multiple IRowSources defined.
The IRowSource interface is analogous to each unique type of web page or Data Table that you needs to harvest. Most of the fields are self-descriptive. The critical fields are as follows:
· StartTag – this is where the data parser should start at
· EndTag – this is where the data parser should end
· Delimiter – this is the character that separates each row/record
· URLs – this a one-to many list of HTML sources for the parser to use
1.6 Defining IDataField Fields
The RDSH parser relies heavily on the HTML tags defined as delimiters both for the data row and for the individual fields. The two critical fields for IDataField are:
· StartTag – this is where the data parser should start at
· EndTag – this is where the data parser should end
· Data Type – this tells the system what type of data to parse for as specified by the FieldDataType enum
1.7 RSDH Data Model
The following depicts the RSDH data model:
1.8 Parser Agent
The heart of the RDSH system is the PageParser class which handle all the parsing heavy lifting. He RDSHService is the entry-point into the system. A caller calls the ProcessByProject method in order to begin an RDSH processing session. The ProcessByProject method expects a project GUID as the parameters and attempts to load the project configuration into memory. An exception is thrown if the project does not exist in the database. So please make sure to add the necessary project configuration data into the database including all the row sources as well as fields.
1.9 Invocation & Automation
Once can go a step further and fully automate the hosting of the RDSHService. This could either be launched as a Windows Service, a WCF Service, or as a cron style console application. The choice entirely depends on the project requirements.
1.10 Handling non-HTML (Java, Silverlight, and Flash Applets)
The RDSH system is limited to parsing pure HTML or text documents only. This system would not be able to parse pages hosted solely as Java applets, Silverlight, and or Flash. This is mainly because the data parser is text-based.