web scraping

Eduardo Gomez 3,431 Reputation points
2021-03-29T01:02:20.113+00:00

Hello everyone

I am working on a little web scraping app, that will list all the ingredients of a recipe app, that I am making (I will upload this in azure).

For some reason, I am not getting the ingredients of all the pages and there are letters that I am not getting, for example.

The U, T, Z, Y, V

I manage to fix the other error.

static void Main(string[] args) {
        NewMethod();
    }

    private static void NewMethod() {

        List<string> list = new List<string>();

       var web = new HtmlWeb();
        for (char alphabet = 'a'; alphabet < 'z'; alphabet++) {

            var doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}");

            HtmlNodeCollection pagesNum = doc.DocumentNode.SelectNodes("//a[@class = 'pagination__link gel-pica-bold']/@href");

            if (pagesNum == null) {
                var nodes = doc.DocumentNode.SelectNodes(
                      "//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']");
                foreach (var item in nodes) {
                    list.Add(item.InnerText.Trim().Replace("ingredient", string.Empty));
                }
            } else {
                for (int i = 1; i < pagesNum.Count; i++) {
                    Console.WriteLine($"alphabet: {alphabet} | page {i}\n");
                    var nodes = doc.DocumentNode.SelectNodes(
                        "//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']");
                    foreach (var element in nodes) {
                        list.Add((element.InnerText.Trim().Replace("ingredient", string.Empty)));
                    }
                }
            }
        }
    }
}

}

By the way this is the website I am scaping https://www.bbc.co.uk/food/ingredients/a-z/a/1#featured-content

C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
11,275 questions
{count} votes

Accepted answer
  1. Timon Yang-MSFT 9,591 Reputation points
    2021-03-29T08:59:48.25+00:00

    The problem occurs when alphabet is e instead of d.

    Please pay attention to the style of the original website. You get the number of pages based on the tags in the web page, like this:

    82371-2.png

    But when alphabet is e, it has only one page, so this tag is omitted from the web page, so this line:

    doc.DocumentNode.SelectNodes("//a[@class ='pagination__link gel-pica-bold']/@href") will get null, null.Count causes the current problem.

    Update

    When there is only one page, we can directly set pagesNum to 1.

    There are some minor problems.

    The current code only loads once when the letter changes, which allows you to load only the first page no matter how many pages there are.

    In addition, no material starts with x, so when the letter is "x", the page will be automatically redirected to the homepage, we can check whether there is a letter list in the currently loaded page to judge this.

               var web = new HtmlWeb();  
                for (char alphabet = 'a'; alphabet <= 'z'; alphabet++)  
                {  
      
                    var doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}");  
                    var nodes = doc.DocumentNode.SelectNodes("//a[@class = 'pagination__link gel-pica-bold']/@href");  
                    var pagesNum = nodes == null ? 1 : nodes.Count();  
                    
                    for (int i = 1; i <= pagesNum; i++)  
                    {  
                        System.Console.WriteLine($"alphabet: {alphabet} | page {i}\n");  
                        doc = web.Load($"https://www.bbc.co.uk/food/ingredients/a-z/{alphabet}/{i}");  
      
                        // No material starts with x, so when alphabet is ‘x’, the page will automatically redirect to the homepage.  
                        // Determine whether the current doc is the home page through this method.  
                        if (doc.DocumentNode.SelectNodes("//*[@class = 'az-keyboard__list']") == null)   
                            break;  
                          
                        var ingridients = doc.DocumentNode.SelectNodes("//*[@class = 'gel-layout__item gel-1/2 gel-1/3@m gel-1/4@xl']")  
                        .ToList();  
                        foreach (var item in ingridients)  
                        {  
                            System.Console.WriteLine(item.InnerText.Replace("ingredient", string.Empty));  
                        }  
                    }  
                }  
    

    If the response is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.