HTML Tags parse using SQL

Ranga Reddy 1 Reputation point
2022-01-13T10:16:05.453+00:00

Below is the sample Input
<p>
<strong>
<em>AAAA</em>
</strong>
</p>
<ul>
<li>#2 </li>
<li>#3 </li>
<li>#6 </li>
</ul>
<p>
<strong>
<em>
<br />BBBBB:</em>
</strong>
</p>
<ul>
<li>Best 295 Business Schools</li>
<li>rrrr</li>
</li>
</ul>
<p>
I need output in below format
164678-image.png

Developer technologies Transact-SQL
SQL Server Other
{count} votes

3 answers

Sort by: Most helpful
  1. Olaf Helper 47,436 Reputation points
    2022-01-13T11:08:46.907+00:00

    I need output in below format

    In SQL? I would say close to impossible. HTML is a markup language for text presentation & formating, it's not structured data.

    0 comments No comments

  2. Yitzhak Khabinsky 26,586 Reputation points
    2022-01-13T16:57:58.127+00:00

    Hi @Ranga Reddy ,

    HTML could be a legitimate (X)HTML.
    In such case, it is very easy to process in T-SQL via its XQuery support.

    Check it out.

    T-SQL

    DECLARE @input XML =   
    N'<p>  
         <strong>  
         <em>AAAA</em>  
         </strong>  
     </p>  
     <ul>  
         <li>#2 </li>  
         <li>#3 </li>  
         <li>#6 </li>  
     </ul>  
     <p>  
         <strong>  
         <em>  
         <br />BBBBB:</em>  
         </strong>  
     </p>  
     <ul>  
         <li>Best 295 Business Schools</li>  
         <li>rrrr</li>  
     </ul>';  
      
    SELECT seq  
    	, p.value('(strong/em/text())[1]', 'VARCHAR(100)') AS col1  
    	, li.value('text()[1]', 'VARCHAR(100)') AS col2  
    FROM @input.nodes('/p') AS t1(p)  
    	CROSS APPLY (  
    		SELECT t1.p.value('let $i := . return count(/p[. << $i]) + 1', 'INT')  
    	) AS t2(seq)  
    	OUTER APPLY @input.nodes('/ul[sql:column("t2.seq")]/li') AS t3(li);  
    

    Output

    +-----+--------+---------------------------+  
    | seq |  col1  |           col2            |  
    +-----+--------+---------------------------+  
    |   1 | AAAA   | #2                        |  
    |   1 | AAAA   | #3                        |  
    |   1 | AAAA   | #6                        |  
    |   2 | BBBBB: | Best 295 Business Schools |  
    |   2 | BBBBB: | rrrr                      |  
    +-----+--------+---------------------------+  
    

  3. Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator
    2022-01-13T22:46:41.58+00:00

    HTML could be a legitimate (X)HTML.
    In such case, it is very easy to process in T-SQL via its XQuery support.

    Indeed. Hopefully that is the case for Ranga.

    However, HTML is more lax than XML. XML requires all tags to be closed. In the above sample, you have <br />, which is good. But HTML permits <br> only. And if your HTML looks like that, you cannot convert it to the XML data type. So how you do parse it in that case? Not in T-SQL, that much is clear. If you want to do it from SQL Server, you would have to use the CLR, or call a script in Python or similar if you have the Machine Learning Extensions installed. (For Python, you need SQL 2017 or later.)

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.