Is every doc, even editable pdf, is processing with a OCR in azure ai document intelligence?

Question

Is every doc, even editable pdf, is processing with a OCR in azure ai document intelligence?

Salomon Mejia 0

How does Azure AI Document Intelligence handle digitally generated (editable) PDFs versus scanned/image-based PDFs? Specifically, does the service detect when a PDF is already editable and extract the embedded text and layout directly, or does it always run the OCR/VLM pipeline for every document?

0 comments

2 answers

Your answer

Answer 1

Hello Salomon Mejia,

Welcome to Microsoft Q&A and Thank you for reaching out.

I understand that you're looking to understand how Azure AI Document Intelligence processes different types of PDFs specifically editable PDFs versus scanned/image PDFs. This can be confusing because some services rely on embedded text layers, while others always run OCR. Azure Document Intelligence follows a consistent approach.

Azure AI Document Intelligence uses a unified OCR + vision-language processing pipeline for every document, regardless of whether the PDF is digitally generated or scanned. Even if a PDF contains an editable, selectable text layer, the service does not skip OCR.

For digitally generated (editable) PDFs, while it may appear that the service should just extract the embedded text directly, Document Intelligence instead processes the file using its own OCR/VLM pipeline. This design ensures consistent output quality, accurate layout interpretation, and clean normalization across multilingual or complex documents. In short, the service does not rely on the embedded text layer inside the PDF.

For scanned or image-based PDFs, the same OCR engine is used to detect text, extract structure, read tables, identify key-value pairs, and understand the overall page layout. Since these documents contain no text layer, OCR is fully required.

The reason Azure applies OCR to all documents editable or scanned is to avoid inconsistencies that come from embedded text layers, which may be missing, corrupted, incorrectly encoded, or structurally unreliable. A single pipeline ensures uniform behavior and accuracy across every document type.

Please refer this

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2025-11-21T17:33:36.1966667+00:00

Hi Salomon Mejia,

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you!
SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator

2025-11-24T18:25:00.01+00:00

Hi Salomon Mejia,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

Answer 2

Azure AI Document Intelligence processes both digitally generated (editable) PDFs and scanned/image-based PDFs, but it handles them differently. For editable PDFs, the service is capable of extracting the embedded text and layout directly without running the OCR (Optical Character Recognition) process. This means that if a PDF is already editable, it can extract text and structure without needing to perform OCR on it.

In contrast, for scanned or image-based PDFs, the OCR process is necessary to extract text, as these documents do not contain embedded text that can be directly accessed. Therefore, Azure AI Document Intelligence optimizes its processing by detecting the type of PDF and applying the appropriate method for text extraction accordingly.

References:

Mauricio Villegas 5 Reputation points

2026-02-11T05:39:42.6766667+00:00

This AI response contradicts what SRILAKSHMI said. PDFs are always OCRed. If it is known that the AI is incorrect, then shouldn't this be removed or somehow labeled as incorrect? Otherwise people might be mislead if they don't read the entire thread.

Share via

Is every doc, even editable pdf, is processing with a OCR in azure ai document intelligence?

2 answers

Your answer