Convert HTML text to Plain text in C#

Jeddah 86 Reputation points
2021-10-18T13:17:48.783+00:00

Hello

I have HTML text stored in a string variable in C#.
I need to convert that text in to plain text and remove all the html tags.

How can I do that

thank u

Developer technologies Windows Presentation Foundation
{count} votes

Accepted answer
  1. Hui Liu-MSFT 48,676 Reputation points Microsoft External Staff
    2021-10-19T07:07:52.463+00:00

    You could convert HTML text to plain text by using the following HTMLToText method.

    The code of xaml:

    <Grid>  
            <RichTextBox>  
                <FlowDocument>  
                    <Paragraph Name="p"></Paragraph>  
                    <Paragraph Name="p1"></Paragraph>  
                </FlowDocument>  
            </RichTextBox>  
        </Grid>  
    

    The code of xaml.cs:

    public partial class MainWindow : Window  
      {  
        public MainWindow()  
        {  
          InitializeComponent();  
          string html = "<p>Some text here</p>" +  
            "< div > Some more<strong> text</ strong ></ div > ";  
          p.Inlines.Add(html);  
          p1.Inlines.Add(HTMLToText(html));  
        }  
        public string HTMLToText(string HTMLCode)  
        {  
          // Remove new lines since they are not visible in HTML  
          HTMLCode = HTMLCode.Replace("\n", " ");  
          // Remove tab spaces  
          HTMLCode = HTMLCode.Replace("\t", " ");  
          // Remove multiple white spaces from HTML  
          HTMLCode = Regex.Replace(HTMLCode, "\\s+", " ");  
          // Remove HEAD tag  
          HTMLCode = Regex.Replace(HTMLCode, "<head.*?</head>", ""  
                              , RegexOptions.IgnoreCase | RegexOptions.Singleline);  
          // Remove any JavaScript  
          HTMLCode = Regex.Replace(HTMLCode, "<script.*?</script>", ""  
            , RegexOptions.IgnoreCase | RegexOptions.Singleline);  
          // Replace special characters like &, <, >, " etc.  
          StringBuilder sbHTML = new StringBuilder(HTMLCode);  
          // Note: There are many more special characters, these are just  
          // most common. You can add new characters in this arrays if needed  
          string[] OldWords = {"&nbsp;", "&amp;", "&quot;", "&lt;",  
       "&gt;", "&reg;", "&copy;", "&bull;", "&trade;","&#39;"};  
          string[] NewWords = { " ", "&", "\"", "<", ">", "®", "©", "•", "™", "\'" };  
          for (int i = 0; i < OldWords.Length; i++)  
          {  
            sbHTML.Replace(OldWords[i], NewWords[i]);  
          }  
          // Check if there are line breaks (<br>) or paragraph (<p>)  
          sbHTML.Replace("<br>", "\n<br>");  
          sbHTML.Replace("<br ", "\n<br ");  
          sbHTML.Replace("<p ", "\n<p ");  
          // Finally, remove all HTML tags and return plain text  
          return System.Text.RegularExpressions.Regex.Replace(  
            sbHTML.ToString(), "<[^>]*>", "");  
        }  
      }  
    

    The picture of result:
    141584-image.png


    If the answer is the right solution, please click Accept Answer and kindly upvote it. If you have extra questions about this answer, please click Comment. 
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.
    0 comments No comments

3 additional answers

Sort by: Most helpful
  1. Viorel 122.5K Reputation points
    2021-10-18T15:39:15.267+00:00

    Sometimes you can use a code like this:

    string html = "Text1 <table border='1'><tr><td>Text2 <b>Text3<!-- comment --></b></td></tr></table>";
    XDocument d = XDocument.Parse( "<r>" + html + "</r>" );
    string text = (string)d.Root;
    
    0 comments No comments

  2. Jeddah 86 Reputation points
    2021-10-19T06:01:16.003+00:00

    Thank you Viorel-1

    Unfortunately it did not work and gives error

    System.Xml.XmlException: 'Reference to undeclared entity 'nbsp'.

    0 comments No comments

  3. Jeddah 86 Reputation points
    2021-10-19T08:14:10.29+00:00

    Thank you very much HuiLiu-MSFT

    Now it is working fine.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.