Share via

Editable pdf ocr not fetching correct values

Prasath P 20 Reputation points
2025-09-11T10:50:06.91+00:00

Hi,
I’m using Azure OCR, but editable PDFs are not fetching the correct values in both prebuilt and custom models. This is blocking our development. Could someone help me fix this issue?

thanks,
Prasath P

Azure Document Intelligence in Foundry Tools

Answer accepted by question author

Alex Burlachenko 21,715 Reputation points MVP Volunteer Moderator
2025-09-12T09:51:56.0433333+00:00

hi,

he issue often lies in how the ocr service interprets the structure of an editable pdf. unlike a flat image or a scanned pdf, an editable pdf has layers of text and form fields that can confuse the extraction model.

first, try using the 'prebuilt-read' model instead of the layout model. sometimes the read api handles messy pdfs a bit better. you can specify this in the analyze request by setting the 'modelId' to 'prebuilt-read'. the docs for that are here https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-read

if that does not work, your best bet is to use the custom model. but you need to train it with a diverse set of samples that include your problematic editable pdfs. the key is to include examples of the exact documents that are failing. this teaches the model how to handle your specific layout and fields.

also, check this. before sending the pdf to azure, try converting it to a high resolution image first. sometimes ocr engines perform better on a flattened image rather than a complex editable pdf. you can use a library like pdf2image for this. this might help in other tools too.

now for a general tip. always validate the ocr results with a human in the loop, especially during development. you can build a simple ui that shows the extracted text next to the original pdf. this helps you spot patterns in the errors and adjust your training data accordingly.

aha, and one more thing. check the quality of your source pdfs. low resolution or blurry text will always cause problems. make sure your documents are clear and high contrast for the best results.

good luck prasath. ocr is never perfect, but with some tuning, you can get it to a usable state. let me know if focusing on the custom model training helps.

Best regards,

Alex

and "yes" if you would follow me at Q&A - personaly thx.
P.S. If my answer help to you, please Accept my answer

https://ctrlaltdel.blog/

Was this answer helpful?

0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Prasath P 20 Reputation points
    2025-09-25T09:28:22.1633333+00:00

    Hello @Sina Salam and @Alex Burlachenko ,
    Thanks for your advice! After uploading 5 sample documents and using table extract, the OCR is now working with about 95% accuracy.

    Was this answer helpful?

    0 comments No comments

  2. Sina Salam 29,516 Reputation points Volunteer Moderator
    2025-09-22T16:11:31.53+00:00

    Hello Prasath P,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your editable pdf ocr not fetching correct values.

    I think without visible text, any model will fail. To make field values visibly rendered use current AADI models & parameters and implement repeatable QA as listed below:

    1. Open the problematic PDF in Document Intelligence Studio to verify if the text is visible and properly rendered before processing. - https://learn.microsoft.com/azure/ai-services/document-intelligence/studio-overview?view=doc-intel-4.0.0
    2. If the file uses XFA, convert or flatten it; for AcroForm PDFs without visible text, generate appearance streams or flatten the form to ensure values are displayed XFA guidance, appearance streams - https://developers.foxit.com/developer-hub/document/appearance-streams-pdf-form-fields/), example fix is here - https://stackoverflow.com/questions/71384819/needappearances-pdfrw-pdfobjecttrue-forces-manual-pdf-save-in-acrobat-reader
    3. Reprocess the normalized file using prebuilt-layout?features=keyValuePairs with API version 2024-11-30, and optionally enable languages, ocrHighResolution, and output=pdf for better accuracy and QA (Layout model, what’s new, REST API), add-ons - https://learn.microsoft.com/azure/ai-services/document-intelligence/whats-new?view=doc-intel-4.0.0, https://learn.microsoft.com/rest/api/aiservices/document-models/analyze-document?view=rest-aiservices-v4.0%20(2024-11-30), and https://learn.microsoft.com/azure/ai-services/document-intelligence/concept/add-on-capabilities?view=doc-intel-4.0.0
    4. If extraction issues persist, rasterize the PDF at 300–400 DPI and analyze it with the Read model to isolate font or encoding problems. - https://learn.microsoft.com/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0
    5. Once accuracy is confirmed, build a custom extraction model trained on normalized samples (flattened or with generated appearances) for consistent results Layout model, XFA fix, appearance streams check this link - https://developers.foxit.com/developer-hub/document/appearance-streams-pdf-form-fields

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.