Download html which also have hyperlinks which points to other html to download as well and go deep to download

nellie 126 Reputation points
2021-02-07T16:26:28.087+00:00

Hi there,
How would you do this?
Point to a start HTML main webpage, download this main and then retrieve all the links and download the sub html pages and then all the subpages retrieved do the same thing and get the links and then also download this.
It's a recursive procedure call that will get all the pages regardless of how deep the links to other pages.
Is there a way you can do this in c# ?

thanks.

Developer technologies | ASP.NET | Other
Developer technologies | C#
0 comments No comments
{count} votes

Accepted answer
  1. Anonymous
    2021-02-08T06:51:19.043+00:00

    Hi @nellie ,

    According to your description, I think it can be implemented in C#.

    First, you can use WebClient to download html resources.

    using System.Net;  
      
    using (WebClient client = new WebClient ()) // WebClient class inherits IDisposable  
    {  
        client.DownloadFile("http://yoursite.com/page.html", @"C:\localfile.html");  
      
        // Or you can get the file content without saving it  
        string htmlCode = client.DownloadString("http://yoursite.com/page.html");  
    }  
    

    And then use Html Agility Pack to traverse all <a> tags in the resource, and then filter to obtain downloadable hyperlink addresses. But there may be other problems, so you need to do some exception handling.

    public static int i = 1;  
        public static void downloadRes(string url)  
        {  
            using (WebClient client = new WebClient())  
            {  
                client.DownloadFile(url, "D:\\localfile" + i++ + ".html");  
                HtmlWeb hw = new HtmlWeb();  
                HtmlDocument doc = hw.Load(url);  
                foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))  
                {  
                    string href = link.Attributes["href"].Value.ToString();  
                    if (href.StartsWith("https"))  
                    {  
                        downloadRes(href);  
                    }  
                }  
            }  
        }  
    

    Hope this can help you.

    Best regards,
    Xudong Peng


    If the answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.