다음을 통해 공유


[Dev Tutorial] WebBrowser - Web Scraping

Have you ever wanted to make the application which will scrape some data from the web site?
If so, this article is just perfect for you.

.NET Framework is a very large and powerful framework and with it you can do almost everything that you want in your application.
.NET Framework provides you WebBrowser class which is Windows Forms class but you can use it in any .NET template you want. I will also explain it in this article.
Since I am C#.NET developer I will show you examples in C#.NET code.

WebBrowser class

WebBrowser class is a very powerful class from which you can manipulate HTML code, navigate through the sites, interact with JavaScript functions and do many other cool things.
Sure, there are other classes which can do their job for web scraping but WebBrowser class is most easiest to learn. 
To work with WebBrowser class you first need to learn what that class can do. So let's start with the basics.
In order to work with that class you need to make instance of it since it is non static class.
I will make here the instance so you will recognize it in the next code examples.

System.Windows.Forms.WebBrowser wb = new  System.Windows.Forms.WebBrowser();

You can find just everything about that class at MSDN but I will show and explain you the things that we need to make a basic web scraper.

Methods in WebBrowser class

Navigate

This is the method which you must have in order to make a web scraper. When this method is called, your application makes connection to the specific URL of the web site.
The best part of this method is that it is loading the entire HTML file from the URL so you can easy manipulate with it. So the example of navigating to some site is:

wb.Navigate("www.imdb.com");

Now, when you call this method, your application will connect with www.imdb.com which is site where you can find some information about the movies like rating, year of the release, list of actors and so on. In the later code we will try to scrap those informations into your application.

Stop

This method will come in use to you if you want to silence the sound of clicking and other sounds from the web site. Sound of clicking is most anoying sound for me when I am navigating through the web sites with my application. You will need to put this method after your navigation method in order to make it work. This is the example how to do call that method:

wb.Stop();

Properties in WebBrowser class

Document

This property is allowing you to gets an HtmlDocument from the web site. Once you get an HtmlDocument to your application you will be able to read or write values from tags attributes. This is the example:

wb.Document.GetElementById("navbar-query").SetAttribute("value", textBox1.Text);

As you can see, I have been used textBox1.Text value to put it in the "value" from HTML. It is more like making input in your console but it is just comparing. 
It will be wise to make a check in your application to see if textBox1.Text is null or not because it will make no sense to make null input in "value" attribute in the HtmlDocument.

I have mentioned that you can manipulate with JavaScript functions. So let's make an example how to use JavaScript "click" function to programatically click on the button which is represented on the web site.

HtmlElement acceptButton = wb.Document.GetElementById("navbar-submit-button");

if (acceptButton != null)
 {
       acceptButton.InvokeMember("click");
 }

It will be wise for you to also check on the imdb.com to see the button which contains elemend id "navbar-submit-button". Your program will click that button with the code I provided you.
We are simply using the instance of the HtmlElement class to store "navbar-submit-button" element from the document. Once we stored it, we are checking if it is null or not. If it is not null it will call JavaScript "click" function to make a click on the specific button.

Web scraping

HtmlElementCollection tables = wb.Document.GetElementsByTagName("table");
try
{
         if (tables.Count <= 0) return;
         HtmlElementCollection rows = tables[0].GetElementsByTagName("tr");
         foreach (HtmlElement row in rows)
         {
                HtmlElementCollection cells = row.GetElementsByTagName("td");
                foreach (HtmlElement cell in cells)
                {
                      String text = cell.InnerText;
                      if (!String.IsNullOrEmpty(text) && !String.IsNullOrWhiteSpace(text))
                      {
                            listBox1.Items.Add(text);
                      }
                 }
         }
}
catch (ArgumentOutOfRangeException exc)
{
      listBox1.Items.Add(exc.Message);
}

Now, I will explain most exciting part of this "how to". This code will display your search in your listBox1 control instance that you will have to create either in your Form or Window. 
I have created an instance of the HtmlElementCollection class which will store all table tags in it's collection. Since I have experienced ArgumentOutOfRangeException exception I will suggest you to use try-catch statement to make sure there is no pop up messages in your application that will tell you the exception message and to close that MessageBox and insted you will show that message in the listBox1 control instance to make your application more user friendly. Now it is very important to check if tables.Count is equal or less than zero and if it is, it will just return nothing. Otherwise it will create a new instance of HtmlEllementCollection class "rows" where you will store all "table" elements by tag name "tr".
In foreach loop you have to go through every rows in the collection and for each one, it will store element by tag name "td" into "cells" instance of the HtmlElementCollection class.
Now, you will have to open a new foreach loop to go through every single cells in the collection so you can catch InnerText and store it in the string variable. This was the most important part.
Now for the end you will have to check if text variable String.IsNullOrEmpty is false and if text variable String.IsNullOrWhiteSpace is false in order to put that specific InnerText in the listBox1 control instance.
Now run your code and search for a movie and all searches will be displayed in the listBox1 control instance.