question

Jeddah-7584 avatar image
0 Votes"
Jeddah-7584 asked Jeddah-7584 answered

Convert HTML text to Plain text in C#

Hello

I have HTML text stored in a string variable in C#.
I need to convert that text in to plain text and remove all the html tags.

How can I do that

thank u

windows-wpf
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HuiLiu-MSFT avatar image
1 Vote"
HuiLiu-MSFT answered

You could convert HTML text to plain text by using the following HTMLToText method.

The code of xaml:

 <Grid>
         <RichTextBox>
             <FlowDocument>
                 <Paragraph Name="p"></Paragraph>
                 <Paragraph Name="p1"></Paragraph>
             </FlowDocument>
         </RichTextBox>
     </Grid>

The code of xaml.cs:

 public partial class MainWindow : Window
   {
     public MainWindow()
     {
       InitializeComponent();
       string html = "<p>Some text here</p>" +
         "< div > Some more<strong> text</ strong ></ div > ";
       p.Inlines.Add(html);
       p1.Inlines.Add(HTMLToText(html));
     }
     public string HTMLToText(string HTMLCode)
     {
       // Remove new lines since they are not visible in HTML
       HTMLCode = HTMLCode.Replace("\n", " ");
       // Remove tab spaces
       HTMLCode = HTMLCode.Replace("\t", " ");
       // Remove multiple white spaces from HTML
       HTMLCode = Regex.Replace(HTMLCode, "\\s+", " ");
       // Remove HEAD tag
       HTMLCode = Regex.Replace(HTMLCode, "<head.*?</head>", ""
                           , RegexOptions.IgnoreCase | RegexOptions.Singleline);
       // Remove any JavaScript
       HTMLCode = Regex.Replace(HTMLCode, "<script.*?</script>", ""
         , RegexOptions.IgnoreCase | RegexOptions.Singleline);
       // Replace special characters like &, <, >, " etc.
       StringBuilder sbHTML = new StringBuilder(HTMLCode);
       // Note: There are many more special characters, these are just
       // most common. You can add new characters in this arrays if needed
       string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;",
    "&gt;", "&reg;", "&copy;", "&bull;", "&trade;","&#39;"};
       string[] NewWords = { " ", "&", "\"", "<", ">", "®", "©", "•", "™", "\'" };
       for (int i = 0; i < OldWords.Length; i++)
       {
         sbHTML.Replace(OldWords[i], NewWords[i]);
       }
       // Check if there are line breaks (<br>) or paragraph (<p>)
       sbHTML.Replace("<br>", "\n<br>");
       sbHTML.Replace("<br ", "\n<br ");
       sbHTML.Replace("<p ", "\n<p ");
       // Finally, remove all HTML tags and return plain text
       return System.Text.RegularExpressions.Regex.Replace(
         sbHTML.ToString(), "<[^>]*>", "");
     }
   }

The picture of result:
141584-image.png


If the answer is the right solution, please click Accept Answer and kindly upvote it. If you have extra questions about this answer, please click Comment. 
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


image.png (7.8 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Viorel-1 avatar image
0 Votes"
Viorel-1 answered Viorel-1 edited

Sometimes you can use a code like this:

 string html = "Text1 <table border='1'><tr><td>Text2 <b>Text3<!-- comment --></b></td></tr></table>";
 XDocument d = XDocument.Parse( "<r>" + html + "</r>" );
 string text = (string)d.Root;


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Jeddah-7584 avatar image
0 Votes"
Jeddah-7584 answered

Thank you Viorel-1

Unfortunately it did not work and gives error

System.Xml.XmlException: 'Reference to undeclared entity 'nbsp'.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Jeddah-7584 avatar image
0 Votes"
Jeddah-7584 answered

Thank you very much HuiLiu-MSFT

Now it is working fine.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.