Share via

SharePoint OCR inconsistently processing scanned documents with identical format and ingestion process

George | Workscene 0 Reputation points
2026-04-23T01:30:36.68+00:00

We are experiencing inconsistent OCR behaviour in SharePoint Online for scanned PDF documents.

All files follow the same ingestion process (MFP scan → FTP to server → synced SharePoint document library with OCR enabled) and use the same PDF format and scan settings. Some documents are successfully OCR’d and searchable, while others upload normally but do not generate OCR text.

There are no upload errors, and affected PDFs appear normal when previewed. Library configuration is consistent, and OCR works for some documents in the same location.

In short, some scanned PDF documents processed through this workflow become searchable in SharePoint, while others do not, with no clear reason.

Please advise on:

  • Known conditions where SharePoint skips OCR for PDFs
  • Whether OCR can fail silently during indexing
  • What diagnostics or logs are available to determine why OCR is not applied
  • Recommended best practices to ensure consistent OCR for scanned PDFs
Microsoft 365 and Office | SharePoint | For business | Windows
0 comments No comments

1 answer

Sort by: Most helpful
  1. Jade Ng 11,670 Reputation points Microsoft External Staff Moderator
    2026-04-23T02:35:01.4866667+00:00

    Dear @George | Workscene,

    Thank you for taking the time to share the details of your observation. I understand how important it is for scanned documents to be consistently searchable in SharePoint, especially when the same scanning and upload process is used.

    Based on Microsoft’s documented behavior for SharePoint Online, Optical Character Recognition (OCR) is processed as a background service after files are uploaded, and it may not always be applied consistently to all scanned PDF documents even if they appear identical and are uploaded to the same document library.

    Regarding your inquiries, please kindly allow me to clarify some information:

    Known conditions where SharePoint skips OCR for PDFs

    1.PDF structure differences (even with identical scan settings)

    Microsoft OCR is applied only to image‑based or hybrid PDFs. However, two PDFs created with the same scanner profile can still differ internally, which affects OCR eligibility:

    • Image-only PDFs > eligible for OCR
    • Hybrid PDFs (image + text layer) > eligible
    • Vector-rendered or flattened text (looks like text but stored as drawing paths) > not OCR’d
    • Partially image-only PDFs > OCR may succeed on some pages and fail on others

    2.File size, resolution, and text extraction limits

    • Maximum PDF size: 50 MB
    • Images must be at least 50 x 50 pixels and not larger than 16,000 x 16,000 pixels.
    • Character extraction cap: OCR extracts and indexes up to ~2 million characters per file
    • Resolution constraints: Extremely low DPI pages may be skipped

    User's image

    3.Site and library placement limitations

    OCR is applied only to supported SharePoint locations: Root sites, hub sites, associated sites. Files uploaded or synced into unsupported scopes will never receive OCR, even though the library settings appear identical.

    4. Encrypted or restricted PDFs

    If a PDF is password-protected or scan-generated with embedded restrictions

    Whether OCR can fail silently during indexing

    Yes, and this is expected behavior.

    • OCR runs as asynchronous post-processing
    • No user-visible error, warning, or status is generated when OCR fails or is skipped
    • Search indexing proceeds, but without OCR text

    What diagnostics or logs are available to determine why OCR is not applied

    • No tenant-facing OCR logs
    • No per-file OCR success/failure status
    • No retry control or manual trigger

    The only practical verification is functional testing (searching for known text strings) or checking whether Extracted text metadata exists images only; PDFs do not expose this field.

    Recommended best practices to ensure consistent OCR for scanned PDFs

    1.Perform OCR before uploading to SharePoint

    • Microsoft Search works most reliably when PDFs already contain a text layer.
    • Best practice is to apply OCR at the scanner/MFP stage or via a dedicated OCR tool (e.g. Adobe).

    2. Standardize scan settings

    Use consistent scan configurations: ≥300 DPI, grayscale or black‑and‑white, no PDF encryption, and avoid post‑scan processing that converts text into vector graphics.

    3. Use modern SharePoint libraries only

    Store OCR‑dependent documents in modern SharePoint sites and avoid subsites or legacy libraries.

    4. Allow time for indexing

    OCR and PDF indexing may take up to 24 hours, especially during high upload volumes.

    If there is anything I can do or if you have any other concerns, please do not hesitate to contact me. Your satisfaction is my top priority, and I am here to support you in any way I can.

    Hope you have a great day!


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".   

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.