How to get the right word count of a .docx's .xml?

UN 20 Reputation points
2023-05-10T10:02:39.1533333+00:00

Hello, I'm trying to keep track of the word count of my documents. In the sample document, counting gives something like "154 886", but the value stored in the .xml is something like "171 725". These results are consistent after multiple saves, also checking and unchecking the checkbox in the counting dialog. I would like the .xml value to be the first one, as it's similar to the word count LibreOffice saves in the .xml, and this way I won't have jumps depending on the program that last saved a certain document. What can I do? Thank you.

Microsoft 365 and Office | Word | For business | Windows
Developer technologies | C#
0 comments No comments
{count} votes

Accepted answer
  1. Anonymous
    2023-05-11T02:47:06.96+00:00

    Hi @UN , Welcome to Microsoft Q&A.

    I did this test by modifying the .docx file into a zip file, and then decompressing it to get the document.xml file.

    Get the <w:body> node by selecting the //w:body path using the SelectSingleNode() method. Then, we use the SelectNodes() method to select all <w:p> nodes under the bodyNode, which represent paragraphs. Then, we iterate through each paragraph node, get the text content of its InnerText property, and use spaces, tabs and newlines to split the string, count the number of words in each paragraph, and add them to the total number of words wordCount.

    // See https://aka.ms/new-console-template for more information
    using System.Xml;
    string xmlFilePath = @"C:\Users\Administrator\Desktop\wordCount\word\document.xml";
    int wordCount = GetWordCountFromXml(xmlFilePath);
    Console.WriteLine($"Word count: {wordCount}");
    static int GetWordCountFromXml(string xmlFilePath)
    {
        int wordCount = 0;
        XmlDocument xmlDoc = new XmlDocument();
        xmlDoc.Load(xmlFilePath);
    
        XmlNamespaceManager namespaceManager = new XmlNamespaceManager(xmlDoc.NameTable);
        namespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");
    
        XmlNode bodyNode = xmlDoc.SelectSingleNode("//w:body", namespaceManager);
        if (bodyNode != null)
        {
            XmlNodeList paragraphNodes = bodyNode.SelectNodes(".//w:p", namespaceManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                string paragraphText = paragraphNode.InnerText.Trim();
                string[] words = paragraphText.Split(new char[] { ' ', '\t', '\n' }, StringSplitOptions.RemoveEmptyEntries);
                wordCount += words.Length;
            }
        }
    
        return wordCount;
    }
    

    User's image

    User's image

    Best Regards,

    Jiale


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment". 

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.