An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
Hello Salomon Mejia,
Welcome to Microsoft Q&A and Thank you for reaching out.
I understand that you're looking to understand how Azure AI Document Intelligence processes different types of PDFs specifically editable PDFs versus scanned/image PDFs. This can be confusing because some services rely on embedded text layers, while others always run OCR. Azure Document Intelligence follows a consistent approach.
Azure AI Document Intelligence uses a unified OCR + vision-language processing pipeline for every document, regardless of whether the PDF is digitally generated or scanned. Even if a PDF contains an editable, selectable text layer, the service does not skip OCR.
For digitally generated (editable) PDFs, while it may appear that the service should just extract the embedded text directly, Document Intelligence instead processes the file using its own OCR/VLM pipeline. This design ensures consistent output quality, accurate layout interpretation, and clean normalization across multilingual or complex documents. In short, the service does not rely on the embedded text layer inside the PDF.
For scanned or image-based PDFs, the same OCR engine is used to detect text, extract structure, read tables, identify key-value pairs, and understand the overall page layout. Since these documents contain no text layer, OCR is fully required.
The reason Azure applies OCR to all documents editable or scanned is to avoid inconsistencies that come from embedded text layers, which may be missing, corrupted, incorrectly encoded, or structurally unreliable. A single pipeline ensures uniform behavior and accuracy across every document type.
Please refer this
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!