Unable to parse input document file

Jacquelin Martineau-Rousseau 135 Reputation points
2025-11-27T15:47:50.1433333+00:00

Hello,

I am having a problem detecting a SIT within a file. Here are some details:

  • docx file
  • contains text + one image
    • the text contains multiple occurences that are detected by the SIT when used in a .txt file
  • a manual test on the SIT directly gives the following error:
|Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException|Unable to parse input document file. Make sure the document contains text and isn't encrypted by a password.

This is a regular docx file, no password, no label.

Any ideas?

Microsoft Security | Microsoft Purview
{count} votes

2 answers

Sort by: Most helpful
  1. Q&A Assist
    2025-11-27T15:48:00.2+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The error message you are encountering, Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException, indicates that the document may not be recognized as containing text content, even though it is a regular .docx file. Here are some potential reasons and troubleshooting steps:

    1. Check Document Content: Ensure that the text in the document is not embedded in a way that makes it unreadable by the parser (e.g., as an image or within text boxes).
    2. Image Content: Since your document contains an image, verify that the presence of the image is not interfering with the text extraction process. Sometimes, if the text is not in the main body but rather in headers, footers, or text boxes, it may not be detected.
    3. File Integrity: Ensure that the .docx file is not corrupted. Try opening it in Microsoft Word and saving it again to create a fresh copy.
    4. Test with Different Formats: Since you mentioned that the SIT works with a .txt file, consider converting the .docx file to a .txt format and testing it again to see if the issue persists.
    5. Document Size and Complexity: If the document is large or complex, consider simplifying it to see if that resolves the issue.

    If these steps do not resolve the issue, you may want to consult further documentation on sensitive information types and their requirements for processing.


    References:


  2. VRISHABHANATH PATIL 2,070 Reputation points Microsoft External Staff Moderator
    2025-11-28T04:41:10.98+00:00

    Hi @Jacquelin Martineau-Rousseau

    Thank you for contacting Microsoft Q&A. Please find below the detailed steps to address the reported issue.

    The below stated error

    Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException

    Unable to parse input document file. Make sure the document contains text and isn't encrypted by a password.

    It is not about your Sensitive Information Type (SIT) being wrong. The SIT works fine — you proved that by testing with a .txt file and a clean .docx without the image. The problem is that the original Word file has something inside (most likely the embedded image or an object) that breaks Purview’s text extraction process. When the parser can’t read the text, it throws this error.

    Why this happens -

    • Corrupt or complex OOXML parts: Certain embedded objects (like EMF/WMF/SVG images or OLE packages) can make the document structure invalid for the extractor.
    • Text inside shapes or controls: If your text is inside text boxes or grouped shapes, Purview may not treat it as normal text.
    • File integrity issues: Large or malformed files sometimes hit parsing limits.

    Microsoft has acknowledged similar behaviors in their forums and guidance:

    How to fix it

    Rebuild the Word file

    -- Open the file in Word → File > Info > Convert (if it shows Compatibility Mode).

    -- Then Save As a fresh .docx.

    -- If Word prompts for repair, choose Open and Repair.

    -- Test again in Purview.

    Normalize the image

    -- Replace the current image with a PNG or JPEG.

    -- Avoid EMF/WMF/SVG or embedded OLE objects.

    -- Reinsert the image and retest.

    Move text out of shape

    -- If text is inside text boxes or grouped shapes, copy it into the main body of the document.

    Reduce complexity

    -- Remove unnecessary headers/footers, tracked changes, or embedded items.

    -- Compress large images.

    Avoid this in the future

    • Insert images as PNG/JPEG, not EMF/SVG/OLE.
    • Keep text in normal paragraphs, not shapes or controls.
    • Enable DLP predicates like “Document couldn’t be scanned” so users get clear feedback when parsing fails:

    Suggested customer-facing explanation

    The issue wasn’t with your SIT — it was with the document structure. The embedded image caused Purview to fail when extracting text. We rebuilt the file and replaced the image with a standard format, and now SIT detection works as expected. Going forward, use PNG/JPEG for images and keep text in the main body to avoid similar issues.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.