HTML Tags parse using SQL

Ranga Reddy 1 Reputation point
2022-01-13T10:16:05.453+00:00

Below is the sample Input
<p>
<strong>
<em>AAAA</em>
</strong>
</p>
<ul>
<li>#2 </li>
<li>#3 </li>
<li>#6 </li>
</ul>
<p>
<strong>
<em>
<br />BBBBB:</em>
</strong>
</p>
<ul>
<li>Best 295 Business Schools</li>
<li>rrrr</li>
</li>
</ul>
<p>
I need output in below format
164678-image.png

Developer technologies | Transact-SQL
Developer technologies | Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
SQL Server | Other
{count} votes

3 answers

Sort by: Most helpful
  1. Olaf Helper 47,586 Reputation points
    2022-01-13T11:08:46.907+00:00

    I need output in below format

    In SQL? I would say close to impossible. HTML is a markup language for text presentation & formating, it's not structured data.

    0 comments No comments

  2. Yitzhak Khabinsky 27,091 Reputation points
    2022-01-13T16:57:58.127+00:00

    Hi @Ranga Reddy ,

    HTML could be a legitimate (X)HTML.
    In such case, it is very easy to process in T-SQL via its XQuery support.

    Check it out.

    T-SQL

    DECLARE @input XML =   
    N'<p>  
         <strong>  
         <em>AAAA</em>  
         </strong>  
     </p>  
     <ul>  
         <li>#2 </li>  
         <li>#3 </li>  
         <li>#6 </li>  
     </ul>  
     <p>  
         <strong>  
         <em>  
         <br />BBBBB:</em>  
         </strong>  
     </p>  
     <ul>  
         <li>Best 295 Business Schools</li>  
         <li>rrrr</li>  
     </ul>';  
      
    SELECT seq  
    	, p.value('(strong/em/text())[1]', 'VARCHAR(100)') AS col1  
    	, li.value('text()[1]', 'VARCHAR(100)') AS col2  
    FROM @input.nodes('/p') AS t1(p)  
    	CROSS APPLY (  
    		SELECT t1.p.value('let $i := . return count(/p[. << $i]) + 1', 'INT')  
    	) AS t2(seq)  
    	OUTER APPLY @input.nodes('/ul[sql:column("t2.seq")]/li') AS t3(li);  
    

    Output

    +-----+--------+---------------------------+  
    | seq |  col1  |           col2            |  
    +-----+--------+---------------------------+  
    |   1 | AAAA   | #2                        |  
    |   1 | AAAA   | #3                        |  
    |   1 | AAAA   | #6                        |  
    |   2 | BBBBB: | Best 295 Business Schools |  
    |   2 | BBBBB: | rrrr                      |  
    +-----+--------+---------------------------+  
    

  3. Erland Sommarskog 128.7K Reputation points MVP Volunteer Moderator
    2022-01-13T22:46:41.58+00:00

    HTML could be a legitimate (X)HTML.
    In such case, it is very easy to process in T-SQL via its XQuery support.

    Indeed. Hopefully that is the case for Ranga.

    However, HTML is more lax than XML. XML requires all tags to be closed. In the above sample, you have <br />, which is good. But HTML permits <br> only. And if your HTML looks like that, you cannot convert it to the XML data type. So how you do parse it in that case? Not in T-SQL, that much is clear. If you want to do it from SQL Server, you would have to use the CLR, or call a script in Python or similar if you have the Machine Learning Extensions installed. (For Python, you need SQL 2017 or later.)

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.