A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
Dear @George | Workscene,
Thank you for taking the time to share the details of your observation. I understand how important it is for scanned documents to be consistently searchable in SharePoint, especially when the same scanning and upload process is used.
Based on Microsoft’s documented behavior for SharePoint Online, Optical Character Recognition (OCR) is processed as a background service after files are uploaded, and it may not always be applied consistently to all scanned PDF documents even if they appear identical and are uploaded to the same document library.
Regarding your inquiries, please kindly allow me to clarify some information:
Known conditions where SharePoint skips OCR for PDFs
1.PDF structure differences (even with identical scan settings)
Microsoft OCR is applied only to image‑based or hybrid PDFs. However, two PDFs created with the same scanner profile can still differ internally, which affects OCR eligibility:
- Image-only PDFs > eligible for OCR
- Hybrid PDFs (image + text layer) > eligible
- Vector-rendered or flattened text (looks like text but stored as drawing paths) > not OCR’d
- Partially image-only PDFs > OCR may succeed on some pages and fail on others
2.File size, resolution, and text extraction limits
- Maximum PDF size: 50 MB
- Images must be at least 50 x 50 pixels and not larger than 16,000 x 16,000 pixels.
- Character extraction cap: OCR extracts and indexes up to ~2 million characters per file
- Resolution constraints: Extremely low DPI pages may be skipped
3.Site and library placement limitations
OCR is applied only to supported SharePoint locations: Root sites, hub sites, associated sites. Files uploaded or synced into unsupported scopes will never receive OCR, even though the library settings appear identical.
4. Encrypted or restricted PDFs
If a PDF is password-protected or scan-generated with embedded restrictions
Whether OCR can fail silently during indexing
Yes, and this is expected behavior.
- OCR runs as asynchronous post-processing
- No user-visible error, warning, or status is generated when OCR fails or is skipped
- Search indexing proceeds, but without OCR text
What diagnostics or logs are available to determine why OCR is not applied
- No tenant-facing OCR logs
- No per-file OCR success/failure status
- No retry control or manual trigger
The only practical verification is functional testing (searching for known text strings) or checking whether Extracted text metadata exists images only; PDFs do not expose this field.
Recommended best practices to ensure consistent OCR for scanned PDFs
1.Perform OCR before uploading to SharePoint
- Microsoft Search works most reliably when PDFs already contain a text layer.
- Best practice is to apply OCR at the scanner/MFP stage or via a dedicated OCR tool (e.g. Adobe).
2. Standardize scan settings
Use consistent scan configurations: ≥300 DPI, grayscale or black‑and‑white, no PDF encryption, and avoid post‑scan processing that converts text into vector graphics.
3. Use modern SharePoint libraries only
Store OCR‑dependent documents in modern SharePoint sites and avoid subsites or legacy libraries.
4. Allow time for indexing
OCR and PDF indexing may take up to 24 hours, especially during high upload volumes.
If there is anything I can do or if you have any other concerns, please do not hesitate to contact me. Your satisfaction is my top priority, and I am here to support you in any way I can.
Hope you have a great day!
If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.