Editable pdf ocr not fetching correct values

Question

Editable pdf ocr not fetching correct values

Prasath P 20

Hi,
I’m using Azure OCR, but editable PDFs are not fetching the correct values in both prebuilt and custom models. This is blocking our development. Could someone help me fix this issue?

thanks,
Prasath P

SRILAKSHMI C 18,990 Reputation points Microsoft External Staff Moderator

2025-09-16T08:09:35.62+00:00

Hi Prasath P,

Did you get any chance to check the below response. Thank you!
SRILAKSHMI C 18,990 Reputation points Microsoft External Staff Moderator

2025-09-22T13:12:23.84+00:00

Hi Prasath P,We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!

Answer accepted by question author

Alex Burlachenko 21,715 MVP Volunteer Moderator

hi,

he issue often lies in how the ocr service interprets the structure of an editable pdf. unlike a flat image or a scanned pdf, an editable pdf has layers of text and form fields that can confuse the extraction model.

first, try using the 'prebuilt-read' model instead of the layout model. sometimes the read api handles messy pdfs a bit better. you can specify this in the analyze request by setting the 'modelId' to 'prebuilt-read'. the docs for that are here https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-read

if that does not work, your best bet is to use the custom model. but you need to train it with a diverse set of samples that include your problematic editable pdfs. the key is to include examples of the exact documents that are failing. this teaches the model how to handle your specific layout and fields.

also, check this. before sending the pdf to azure, try converting it to a high resolution image first. sometimes ocr engines perform better on a flattened image rather than a complex editable pdf. you can use a library like pdf2image for this. this might help in other tools too.

now for a general tip. always validate the ocr results with a human in the loop, especially during development. you can build a simple ui that shows the extracted text next to the original pdf. this helps you spot patterns in the errors and adjust your training data accordingly.

aha, and one more thing. check the quality of your source pdfs. low resolution or blurry text will always cause problems. make sure your documents are clear and high contrast for the best results.

good luck prasath. ocr is never perfect, but with some tuning, you can get it to a usable state. let me know if focusing on the custom model training helps.

Best regards,

Alex

and "yes" if you would follow me at Q&A - personaly thx.
P.S. If my answer help to you, please Accept my answer

https://ctrlaltdel.blog/

0 comments

2 additional answers

Your answer

SRILAKSHMI C 18,990 Reputation points Microsoft External Staff Moderator

2025-09-16T08:09:35.62+00:00

Hi Prasath P,

Did you get any chance to check the below response. Thank you!
SRILAKSHMI C 18,990 Reputation points Microsoft External Staff Moderator

2025-09-22T13:12:23.84+00:00

Hi Prasath P,We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!

Answer 1

Prasath P 20

Hello @Sina Salam and @Alex Burlachenko ,
Thanks for your advice! After uploading 5 sample documents and using table extract, the OCR is now working with about 95% accuracy.

0 comments

Answer 2

Hello Prasath P,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your editable pdf ocr not fetching correct values.

I think without visible text, any model will fail. To make field values visibly rendered use current AADI models & parameters and implement repeatable QA as listed below:

Open the problematic PDF in Document Intelligence Studio to verify if the text is visible and properly rendered before processing. - https://learn.microsoft.com/azure/ai-services/document-intelligence/studio-overview?view=doc-intel-4.0.0
If the file uses XFA, convert or flatten it; for AcroForm PDFs without visible text, generate appearance streams or flatten the form to ensure values are displayed XFA guidance, appearance streams - https://developers.foxit.com/developer-hub/document/appearance-streams-pdf-form-fields/), example fix is here - https://stackoverflow.com/questions/71384819/needappearances-pdfrw-pdfobjecttrue-forces-manual-pdf-save-in-acrobat-reader
Reprocess the normalized file using prebuilt-layout?features=keyValuePairs with API version 2024-11-30, and optionally enable languages, ocrHighResolution, and output=pdf for better accuracy and QA (Layout model, what’s new, REST API), add-ons - https://learn.microsoft.com/azure/ai-services/document-intelligence/whats-new?view=doc-intel-4.0.0, https://learn.microsoft.com/rest/api/aiservices/document-models/analyze-document?view=rest-aiservices-v4.0%20(2024-11-30), and https://learn.microsoft.com/azure/ai-services/document-intelligence/concept/add-on-capabilities?view=doc-intel-4.0.0
If extraction issues persist, rasterize the PDF at 300–400 DPI and analyze it with the Read model to isolate font or encoding problems. - https://learn.microsoft.com/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0
Once accuracy is confirmed, build a custom extraction model trained on normalized samples (flattened or with generated appearances) for consistent results Layout model, XFA fix, appearance streams check this link - https://developers.foxit.com/developer-hub/document/appearance-streams-pdf-form-fields

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Editable pdf ocr not fetching correct values

2 additional answers

Your answer