Share via

Is every doc, even editable pdf, is processing with a OCR in azure ai document intelligence?

Salomon Mejia 0 Reputation points
2025-11-19T21:46:08.0233333+00:00

How does Azure AI Document Intelligence handle digitally generated (editable) PDFs versus scanned/image-based PDFs? Specifically, does the service detect when a PDF is already editable and extract the embedded text and layout directly, or does it always run the OCR/VLM pipeline for every document?

Azure Document Intelligence in Foundry Tools
0 comments No comments

2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 16,705 Reputation points Microsoft External Staff Moderator
    2025-11-20T17:30:19.5333333+00:00

    Hello Salomon Mejia,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    I understand that you're looking to understand how Azure AI Document Intelligence processes different types of PDFs specifically editable PDFs versus scanned/image PDFs. This can be confusing because some services rely on embedded text layers, while others always run OCR. Azure Document Intelligence follows a consistent approach.

    Azure AI Document Intelligence uses a unified OCR + vision-language processing pipeline for every document, regardless of whether the PDF is digitally generated or scanned. Even if a PDF contains an editable, selectable text layer, the service does not skip OCR.

    For digitally generated (editable) PDFs, while it may appear that the service should just extract the embedded text directly, Document Intelligence instead processes the file using its own OCR/VLM pipeline. This design ensures consistent output quality, accurate layout interpretation, and clean normalization across multilingual or complex documents. In short, the service does not rely on the embedded text layer inside the PDF.

    For scanned or image-based PDFs, the same OCR engine is used to detect text, extract structure, read tables, identify key-value pairs, and understand the overall page layout. Since these documents contain no text layer, OCR is fully required.

    The reason Azure applies OCR to all documents editable or scanned is to avoid inconsistencies that come from embedded text layers, which may be missing, corrupted, incorrectly encoded, or structurally unreliable. A single pipeline ensures uniform behavior and accuracy across every document type.

    Please refer this

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!


  2. Q&A Assist
    2025-11-19T21:46:23.5633333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Azure AI Document Intelligence processes both digitally generated (editable) PDFs and scanned/image-based PDFs, but it handles them differently. For editable PDFs, the service is capable of extracting the embedded text and layout directly without running the OCR (Optical Character Recognition) process. This means that if a PDF is already editable, it can extract text and structure without needing to perform OCR on it.

    In contrast, for scanned or image-based PDFs, the OCR process is necessary to extract text, as these documents do not contain embedded text that can be directly accessed. Therefore, Azure AI Document Intelligence optimizes its processing by detecting the type of PDF and applying the appropriate method for text extraction accordingly.


    References:


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.