question

EduardoGomez-1870 avatar image
0 Votes"
EduardoGomez-1870 asked TimonYang-MSFT edited

web scraping

Hello everyone

I am working on a little web scraping app, that will list all the ingredients of a recipe app, that I am making (I will upload this in azure).

For some reason, I am not getting the ingredients of all the pages and there are letters that I am not getting, for example.

The U, T, Z, Y, V

I manage to fix the other error.

 static void Main(string[] args) {
         NewMethod();
     }

     private static void NewMethod() {

         List<string> list = new List<string>();

        var web = new HtmlWeb();
         for (char alphabet = 'a'; alphabet < 'z'; alphabet++) {

             var doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}");

             HtmlNodeCollection pagesNum = doc.DocumentNode.SelectNodes("//a[@class = 'pagination__link gel-pica-bold']/@href");

             if (pagesNum == null) {
                 var nodes = doc.DocumentNode.SelectNodes(
                       "//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']");
                 foreach (var item in nodes) {
                     list.Add(item.InnerText.Trim().Replace("ingredient", string.Empty));
                 }
             } else {
                 for (int i = 1; i < pagesNum.Count; i++) {
                     Console.WriteLine($"alphabet: {alphabet} | page {i}\n");
                     var nodes = doc.DocumentNode.SelectNodes(
                         "//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']");
                     foreach (var element in nodes) {
                         list.Add((element.InnerText.Trim().Replace("ingredient", string.Empty)));
                     }
                 }
             }
         }
     }
 }

}

By the way this is the website I am scaping https://www.bbc.co.uk/food/ingredients/a-z/a/1#featured-content

dotnet-csharp
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.


Did you determine which line gives the error?

I think that you should call web.Load inside the loop for all of the pages that are detected by your first SelectNodes. After loading a page, use your second SelectNodes to extract data.

By the way, it probably has no sense to remove “ingridient” because such word does not seem to exist.


0 Votes 0 ·

1 Answer

TimonYang-MSFT avatar image
1 Vote"
TimonYang-MSFT answered TimonYang-MSFT edited

The problem occurs when alphabet is e instead of d.

Please pay attention to the style of the original website. You get the number of pages based on the tags in the web page, like this:

82371-2.png

But when alphabet is e, it has only one page, so this tag is omitted from the web page, so this line:

doc.DocumentNode.SelectNodes("//a[@class ='pagination__link gel-pica-bold']/@href") will get null, null.Count causes the current problem.

Update

When there is only one page, we can directly set pagesNum to 1.

There are some minor problems.

The current code only loads once when the letter changes, which allows you to load only the first page no matter how many pages there are.

In addition, no material starts with x, so when the letter is "x", the page will be automatically redirected to the homepage, we can check whether there is a letter list in the currently loaded page to judge this.

            var web = new HtmlWeb();
             for (char alphabet = 'a'; alphabet <= 'z'; alphabet++)
             {
    
                 var doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}");
                 var nodes = doc.DocumentNode.SelectNodes("//a[@class = 'pagination__link gel-pica-bold']/@href");
                 var pagesNum = nodes == null ? 1 : nodes.Count();
                  
                 for (int i = 1; i <= pagesNum; i++)
                 {
                     System.Console.WriteLine($"alphabet: {alphabet} | page {i}\n");
                     doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}/{i}");
    
                     // No material starts with x, so when alphabet is ‘x’, the page will automatically redirect to the homepage.
                     // Determine whether the current doc is the home page through this method.
                     if (doc.DocumentNode.SelectNodes("//*[@class = 'az-keyboard__list']") == null) 
                         break;
                        
                     var ingridients = doc.DocumentNode.SelectNodes("//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']")
                     .ToList();
                     foreach (var item in ingridients)
                     {
                         System.Console.WriteLine(item.InnerText.Replace("ingredient", string.Empty));
                     }
                 }
             }

If the response is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


2.png (22.9 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you very much, what do you suggest I can do?

How can I optimize this because I have a for inside other for

0 Votes 0 ·