Read an HTML-table in C#

Maurizio Porro 40 Reputation points
2024-11-28T08:33:43.3033333+00:00

Hello

I'm trying to read an HTML-table from this link

[https://www.borsaitaliana.it/borsa/obbligazioni/mot/btp/lista.html?lang=it&page=1]

I wrote this code

string strLineOut = "";

string pathOut = "C:\Users\porro\OneDrive\Desktop\sqlintegerdatatypes.csv";

string delimiter = "|"; // often we use a comma for this

var html = @"https://www.borsaitaliana.it/borsa/obbligazioni/mot/btp/lista.html?lang=it&page=1";

//var html = @"https://begincodingnow.com/sql-server-integer-data-types/";

Console.WriteLine("Scraping tables from: " + html);

HtmlWeb doc = new HtmlWeb();

var htmlDoc = doc.Load(html);

using (var sw = new StreamWriter(pathOut, true))

{

foreach (HtmlNode table in htmlDoc.DocumentNode.SelectNodes("//table"))

    {

    Console.WriteLine("\nFound: " + table.Name);

    //sw.WriteLine("\nFound: " + table.Name);

    foreach (HtmlNode row in table.SelectNodes("tr"))

        {

        strLineOut = "";

        Console.WriteLine("");


        

            foreach (HtmlNode cell in row.SelectNodes("tr|td"))

            {

                Console.Write(cell.InnerText + delimiter);

                strLineOut = strLineOut + cell.InnerText + delimiter;

            }


        

        // remove last separator from string as it creates column at end

        strLineOut = strLineOut.Substring(0, strLineOut.Length - 1);

        sw.WriteLine(strLineOut);

        }

    }

}
Running it with Visual Studio, I have this error

System.NullReferenceException: 'Riferimento a un oggetto non impostato su un'istanza di oggetto.'

at line

foreach (HtmlNode cell in row.SelectNodes("tr|td"))

Please can you help me to resolve the problem?

Thank you in advance

Goodbye

Developer technologies | C#
0 comments No comments
{count} votes

Accepted answer
  1. Anonymous
    2024-11-28T09:08:34.7566667+00:00

    Hi @Maurizio Porro , Welcome to Microsoft Q&A,

    The error occurs because row.SelectNodes("tr|td") is returning null, meaning there are no tr or td elements inside the row node.

    The row.SelectNodes("tr|td") XPath is incorrect. Instead, you likely want row.SelectNodes("th|td") or row.SelectNodes("td").

    using HtmlAgilityPack;
    using System;
    using System.IO;
    
    class Program
    {
        static void Main()
        {
            string strLineOut = "";
            string pathOut = "";
            string delimiter = "|";
            string url = ";
    
            Console.WriteLine("Scraping tables from: " + url);
    
            HtmlWeb web = new HtmlWeb();
            var htmlDoc = web.Load(url);
    
            using (var sw = new StreamWriter(pathOut, true))
            {
                foreach (HtmlNode table in htmlDoc.DocumentNode.SelectNodes("//table"))
                {
                    Console.WriteLine("\nFound: " + table.Name);
    
                    foreach (HtmlNode row in table.SelectNodes(".//tr"))
                    {
                        strLineOut = "";
                        var cells = row.SelectNodes(".//th|.//td");
                        if (cells != null)
                        {
                            foreach (HtmlNode cell in cells)
                            {
                                Console.Write(cell.InnerText.Trim() + delimiter);
                                strLineOut += cell.InnerText.Trim() + delimiter;
                            }
    
                            if (strLineOut.Length > 0)
                            {
                                strLineOut = strLineOut.Substring(0, strLineOut.Length - 1);
                            }
                            sw.WriteLine(strLineOut);
                        }
                    }
                }
            }
    
            Console.WriteLine("Scraping completed.");
        }
    }
    
    

    Best Regards,

    Jiale


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment". 

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Ki-lianK-7341 935 Reputation points
    2024-11-28T08:48:30.2+00:00

    Hello!

    The error you’re encountering, System.NullReferenceException, suggests that the SelectNodes method is returning null because it can’t find any matching nodes. This often happens when the XPath expression is incorrect or the HTML structure doesn’t match your expectations.

    Key Changes:

    1. Null Checks: Added checks to ensure SelectNodes returns non-null results before proceeding.
    2. Node Selection: Changed row.SelectNodes("tr|td") to row.SelectNodes("th|td") to correctly select table header and data cells.

    This should help you avoid the NullReferenceException and give you more insight into where the issue might be occurring.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.