Issue with DocumentFormat.OpenXml reading docX file

Dhananjay Siwach 26 Reputation points
2021-04-30T11:47:46.617+00:00

Hi ,

I am using DocumentFormat.OpenXml for reading content from .docX file in asp.net c#.
I have issue with paragraph.InnerText it is given " TOC \o \"1-2\" \h \z \u 1.Introduction PAGEREF _Toc294041589 \h 4" but I need only content without heading. how I can achieve it.

My Code


Package wordPackage = Package.Open(filePath, FileMode.Open, FileAccess.Read);
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(wordPackage))
{
StringBuilder stringBuilder = new StringBuilder();

                IEnumerable<Paragraph> paragraphs = wordDocument.MainDocumentPart.Document.Body.Elements<Paragraph>();

                foreach (var paragraph in paragraphs)
                {
                    Console.WriteLine(paragraph.InnerText);
                    stringBuilder.Append(paragraph.InnerText + "\r\n");
                }
                string content = stringBuilder.ToString();
            }
Developer technologies | ASP.NET | Other
Developer technologies | C#
Developer technologies | C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
{count} votes

1 answer

Sort by: Most helpful
  1. Alberto Poblacion 1,571 Reputation points
    2021-05-01T18:22:20.337+00:00

    If you use InnerText, it concatenates all the texts of all the xml elements that make up the paragraph internally. That's not what you want.
    Instead, you need to enumerate all the Run elements of the Paragraph and for each Run, each of the Text elements, and then you take the text from there.

    To enumerate the Runs, you would use a loop similar to this:
    foreach (var run in paragraph.Elements<Run>())

    And a similar loop would enumerate the run.Elements<Text> to get all the texts.

    For more info, explore the documentation starting here for the Run.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.