Issue with DocumentFormat.OpenXml reading docX file

Dhananjay Siwach 21 Reputation points

Hi ,

I am using DocumentFormat.OpenXml for reading content from .docX file in c#.
I have issue with paragraph.InnerText it is given " TOC \o \"1-2\" \h \z \u 1.Introduction PAGEREF _Toc294041589 \h 4" but I need only content without heading. how I can achieve it.

My Code

Package wordPackage = Package.Open(filePath, FileMode.Open, FileAccess.Read);
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(wordPackage))
StringBuilder stringBuilder = new StringBuilder();

                IEnumerable<Paragraph> paragraphs = wordDocument.MainDocumentPart.Document.Body.Elements<Paragraph>();

                foreach (var paragraph in paragraphs)
                    stringBuilder.Append(paragraph.InnerText + "\r\n");
                string content = stringBuilder.ToString();
A set of technologies in the .NET Framework for building web applications and XML web services.
3,185 questions
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
9,978 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Alberto Poblacion 1,551 Reputation points

    If you use InnerText, it concatenates all the texts of all the xml elements that make up the paragraph internally. That's not what you want.
    Instead, you need to enumerate all the Run elements of the Paragraph and for each Run, each of the Text elements, and then you take the text from there.

    To enumerate the Runs, you would use a loop similar to this:
    foreach (var run in paragraph.Elements<Run>())

    And a similar loop would enumerate the run.Elements<Text> to get all the texts.

    For more info, explore the documentation starting here for the Run.