c# and HtmlAgilityPack, need help extracting 2 pieces of text

Question

c# and HtmlAgilityPack, need help extracting 2 pieces of text

moondaddy 916

Using VS 2019, .net 4.8 and HtmlAgilityPack V1.4.9.0 Need help extracting text shown in the screenshot below in the red rectangles.

The text in the green rectangle is unique in the entire document and therefore can be used as a starting point to find the other 2.

Attached is a text file with all the html for the "section" element for this example.

67408-sample-personname.txt

Thank you for any help you can offer!

Accepted answer

3 additional answers

Your answer

Answer 1

moondaddy 916

OK, here's some success and got the first one, but can't seem to find the xpath for the 2nd one:

//h3 [@data-auto='Company Summary']/../following-sibling::div/div[1]/div[1]/div[1]/div[2]/address[1]/div

but need help on the xpath for this one

I thought this would work, but it didn't:
//h3 [@data-auto='Company Summary']/../following-sibling::div/div1/div1/div1/div2/address2/div/following-sibling

Kyle Wang 5,531 Reputation points Microsoft External Staff

2021-02-15T03:06:22.517+00:00

They are located in different divs under the same address. So why did you use "address[2]"?
Michael Taylor 60,331 Reputation points

2021-02-17T14:29:26.743+00:00

The code I gave in my post work with the XML you posted. It is way simpler than all that div stuff you're trying to do in a single XPath query. It is also easier to read, easier to adjust for missing values and probably more maintainable.

Answer 2

I'm not convinced that using the anchor element gains you anything. The relationship to the nodes you want is a simple child/sibling so you'd have to navigate a little. Assuming that this is not part of some page that has a bunch of this HTML on it then I think it is simply easier to query for the exact items you want. For purposes of an example I'm going to assume you already have a variable section that contains the node that is pointing to the root section item you posted. How you get to there is up to you.

var contactName = "";
var contactTitle = "";

//Find the contact information
var contactNode = section.SelectSingleNode("//address[@data-auto='details-contact']");
if (contactNode != null)
{
    //Now find the contact name by looking for the class
    // Approach 1 would be to use the class path //div[@class='details-shared-tile__main']
    // Approach 2 is to get the contact-name element and work down, since we need this for the title 
    //  anyway we'll go that route
    var contactNameNode = contactNode.SelectSingleNode("//div[@data-auto='details-contact-name'][1]"); ;

    if (contactNameNode != null)
    {
        contactName = contactNameNode?.InnerText?.Trim();

        //Get the title which is a paragraph in the middle of the divs
        //Could rely on the fact that it immediately follows the contact name section though
        var titleNode = contactNameNode.NextSibling;
        contactTitle = titleNode.InnerText?.Trim();
    };
};

Note that you could convert this to a function to make it easier but ultimately this code is using what should be unique within the section to find things. This is heavily dependent upon the HTML being generated so if there are any changes then finding the data will have to change. To get the contact name we first find the contact div. We'll need this as it is the parent of all the data we want. To find the name we could either search for the CSS class or just assume the name is the first child div. I went the latter route as it is more reliable in this case given that the CSS class looks like it came from a master/detail webforms page and therefore may change. Once we have the name then the title (which is a paragraph in the middle of divs) is found by jumping to the next sibling.

This isn't guaranteed to work in all cases so some heuristics may need to be added to try to detect things if you don't find them initially but this should give you a good start.

moondaddy 916 Reputation points

2021-02-18T04:01:22.583+00:00

Thank you @Michael Taylor for breaking this out and documenting it. very helpful.

Answer 3

Viorel 122.6K

Try something like this too:

HtmlAgilityPack.HtmlDocument doc = . . .  
  
var h3 = doc.DocumentNode.SelectSingleNode( "//section//h3[@data-auto='Company Sumary']" );  
var section = h3.SelectSingleNode( "ancestor::section[position()=1]" );  
var address = section.SelectSingleNode( "//address[@data-auto='details-contact']" );  
var n = address.SelectSingleNode( "//div[@class='details-shared-title__main']" );  
  
string piece1 = n.GetDirectInnerText().Trim();  
string piece2 = address.GetDirectInnerText( ).Trim();

If it does not fully work, you can obviously show the HTML as a text or give the URL.

moondaddy 916 Reputation points

2021-02-18T04:02:08.75+00:00

Thank you @Viorel for making this short and concise.

Answer 4

moondaddy 916

Thanks @Viorel .

A few things:

h3 is null. I don't know if this is the issue, but I have been finding that xpath attribute values sometimes. This is an example:

If I could get past this, I think it would work.

One other thing, "GetDirectInnerText()" was not accessible or recognized.
But this seems OK:
n.InnerText.Trim();

Any ideas about the h3 problem?

Viorel 122.6K Reputation points

2021-02-12T20:01:37.423+00:00

Try "Summary" instead of "Sumary".

Maybe consider the latest version of the package, which can be got using "Manage NuGet Packages" window.
moondaddy 916 Reputation points

2021-02-18T03:48:38.437+00:00

yep, Thanks @Viorel my bad eyes. Thank you

Share via

c# and HtmlAgilityPack, need help extracting 2 pieces of text

3 additional answers

Your answer