c# and HtmlAgilityPack, need help extracting 2 pieces of text

moondaddy 881 Reputation points
2021-02-12T18:05:24.077+00:00

Using VS 2019, .net 4.8 and HtmlAgilityPack V1.4.9.0 Need help extracting text shown in the screenshot below in the red rectangles.

The text in the green rectangle is unique in the entire document and therefore can be used as a starting point to find the other 2.

Attached is a text file with all the html for the "section" element for this example.

67438-image.png

67408-sample-personname.txt

Thank you for any help you can offer!

C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
9,423 questions
0 comments No comments
{count} votes

Accepted answer
  1. moondaddy 881 Reputation points
    2021-02-12T20:40:08.793+00:00

    OK, here's some success and got the first one, but can't seem to find the xpath for the 2nd one:

    //h3 [@data-auto='Company Summary']/../following-sibling::div/div[1]/div[1]/div[1]/div[2]/address[1]/div  
    

    67517-image.png

    but need help on the xpath for this one
    67509-image.png

    I thought this would work, but it didn't:
    //h3 [@data-auto='Company Summary']/../following-sibling::div/div1/div1/div1/div2/address2/div/following-sibling


3 additional answers

Sort by: Most helpful
  1. Michael Taylor 41,811 Reputation points
    2021-02-12T19:15:50.62+00:00

    I'm not convinced that using the anchor element gains you anything. The relationship to the nodes you want is a simple child/sibling so you'd have to navigate a little. Assuming that this is not part of some page that has a bunch of this HTML on it then I think it is simply easier to query for the exact items you want. For purposes of an example I'm going to assume you already have a variable section that contains the node that is pointing to the root section item you posted. How you get to there is up to you.

    var contactName = "";
    var contactTitle = "";
    
    //Find the contact information
    var contactNode = section.SelectSingleNode("//address[@data-auto='details-contact']");
    if (contactNode != null)
    {
        //Now find the contact name by looking for the class
        // Approach 1 would be to use the class path //div[@class='details-shared-tile__main']
        // Approach 2 is to get the contact-name element and work down, since we need this for the title 
        //  anyway we'll go that route
        var contactNameNode = contactNode.SelectSingleNode("//div[@data-auto='details-contact-name'][1]"); ;
    
        if (contactNameNode != null)
        {
            contactName = contactNameNode?.InnerText?.Trim();
    
            //Get the title which is a paragraph in the middle of the divs
            //Could rely on the fact that it immediately follows the contact name section though
            var titleNode = contactNameNode.NextSibling;
            contactTitle = titleNode.InnerText?.Trim();
        };
    };
    

    Note that you could convert this to a function to make it easier but ultimately this code is using what should be unique within the section to find things. This is heavily dependent upon the HTML being generated so if there are any changes then finding the data will have to change. To get the contact name we first find the contact div. We'll need this as it is the parent of all the data we want. To find the name we could either search for the CSS class or just assume the name is the first child div. I went the latter route as it is more reliable in this case given that the CSS class looks like it came from a master/detail webforms page and therefore may change. Once we have the name then the title (which is a paragraph in the middle of divs) is found by jumping to the next sibling.

    This isn't guaranteed to work in all cases so some heuristics may need to be added to try to detect things if you don't find them initially but this should give you a good start.


  2. Viorel 106.3K Reputation points
    2021-02-12T19:41:44.427+00:00

    Try something like this too:

    HtmlAgilityPack.HtmlDocument doc = . . .  
      
    var h3 = doc.DocumentNode.SelectSingleNode( "//section//h3[@data-auto='Company Sumary']" );  
    var section = h3.SelectSingleNode( "ancestor::section[position()=1]" );  
    var address = section.SelectSingleNode( "//address[@data-auto='details-contact']" );  
    var n = address.SelectSingleNode( "//div[@class='details-shared-title__main']" );  
      
    string piece1 = n.GetDirectInnerText().Trim();  
    string piece2 = address.GetDirectInnerText( ).Trim();  
    

    If it does not fully work, you can obviously show the HTML as a text or give the URL.


  3. moondaddy 881 Reputation points
    2021-02-12T19:58:15.013+00:00

    Thanks @Viorel .

    A few things:

    h3 is null. I don't know if this is the issue, but I have been finding that xpath attribute values sometimes. This is an example:
    67552-image.png

    If I could get past this, I think it would work.

    One other thing, "GetDirectInnerText()" was not accessible or recognized.
    But this seems OK:
    n.InnerText.Trim();

    Any ideas about the h3 problem?