SharePoint OCR inconsistently processing scanned documents with identical format and ingestion process

Question

SharePoint OCR inconsistently processing scanned documents with identical format and ingestion process

George | Workscene 0

We are experiencing inconsistent OCR behaviour in SharePoint Online for scanned PDF documents.

All files follow the same ingestion process (MFP scan → FTP to server → synced SharePoint document library with OCR enabled) and use the same PDF format and scan settings. Some documents are successfully OCR’d and searchable, while others upload normally but do not generate OCR text.

There are no upload errors, and affected PDFs appear normal when previewed. Library configuration is consistent, and OCR works for some documents in the same location.

In short, some scanned PDF documents processed through this workflow become searchable in SharePoint, while others do not, with no clear reason.

Please advise on:

Known conditions where SharePoint skips OCR for PDFs
Whether OCR can fail silently during indexing
What diagnostics or logs are available to determine why OCR is not applied
Recommended best practices to ensure consistent OCR for scanned PDFs

0 comments

1 answer

Your answer

Answer 1

Dear @George | Workscene,

Thank you for taking the time to share the details of your observation. I understand how important it is for scanned documents to be consistently searchable in SharePoint, especially when the same scanning and upload process is used.

Based on Microsoft’s documented behavior for SharePoint Online, Optical Character Recognition (OCR) is processed as a background service after files are uploaded, and it may not always be applied consistently to all scanned PDF documents even if they appear identical and are uploaded to the same document library.

Regarding your inquiries, please kindly allow me to clarify some information:

Known conditions where SharePoint skips OCR for PDFs

1.PDF structure differences (even with identical scan settings)

Microsoft OCR is applied only to image‑based or hybrid PDFs. However, two PDFs created with the same scanner profile can still differ internally, which affects OCR eligibility:

Image-only PDFs > eligible for OCR
Hybrid PDFs (image + text layer) > eligible
Vector-rendered or flattened text (looks like text but stored as drawing paths) > not OCR’d
Partially image-only PDFs > OCR may succeed on some pages and fail on others

2.File size, resolution, and text extraction limits

Maximum PDF size: 50 MB
Images must be at least 50 x 50 pixels and not larger than 16,000 x 16,000 pixels.
Character extraction cap: OCR extracts and indexes up to ~2 million characters per file
Resolution constraints: Extremely low DPI pages may be skipped

User's image

3.Site and library placement limitations

OCR is applied only to supported SharePoint locations: Root sites, hub sites, associated sites. Files uploaded or synced into unsupported scopes will never receive OCR, even though the library settings appear identical.

4. Encrypted or restricted PDFs

If a PDF is password-protected or scan-generated with embedded restrictions

Whether OCR can fail silently during indexing

Yes, and this is expected behavior.

OCR runs as asynchronous post-processing
No user-visible error, warning, or status is generated when OCR fails or is skipped
Search indexing proceeds, but without OCR text

What diagnostics or logs are available to determine why OCR is not applied

No tenant-facing OCR logs
No per-file OCR success/failure status
No retry control or manual trigger

The only practical verification is functional testing (searching for known text strings) or checking whether Extracted text metadata exists images only; PDFs do not expose this field.

Recommended best practices to ensure consistent OCR for scanned PDFs

1.Perform OCR before uploading to SharePoint

Microsoft Search works most reliably when PDFs already contain a text layer.
Best practice is to apply OCR at the scanner/MFP stage or via a dedicated OCR tool (e.g. Adobe).

2. Standardize scan settings

Use consistent scan configurations: ≥300 DPI, grayscale or black‑and‑white, no PDF encryption, and avoid post‑scan processing that converts text into vector graphics.

3. Use modern SharePoint libraries only

Store OCR‑dependent documents in modern SharePoint sites and avoid subsites or legacy libraries.

4. Allow time for indexing

OCR and PDF indexing may take up to 24 hours, especially during high upload volumes.

If there is anything I can do or if you have any other concerns, please do not hesitate to contact me. Your satisfaction is my top priority, and I am here to support you in any way I can.

Hope you have a great day!

If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Jade Ng 11,670 Reputation points Microsoft External Staff Moderator

2026-04-24T01:28:53.5433333+00:00

Dear @George | Workscene,

I hope this message finds you well!

I just wanted to check in and see if you have any updates or additional information you’d like to share regarding your situation. If there's anything new or if you need further assistance, please don’t hesitate to let me know.

Looking forward to your response!

Share via

SharePoint OCR inconsistently processing scanned documents with identical format and ingestion process

1 answer

Your answer