An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
hi,
he issue often lies in how the ocr service interprets the structure of an editable pdf. unlike a flat image or a scanned pdf, an editable pdf has layers of text and form fields that can confuse the extraction model.
first, try using the 'prebuilt-read' model instead of the layout model. sometimes the read api handles messy pdfs a bit better. you can specify this in the analyze request by setting the 'modelId' to 'prebuilt-read'. the docs for that are here https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-read
if that does not work, your best bet is to use the custom model. but you need to train it with a diverse set of samples that include your problematic editable pdfs. the key is to include examples of the exact documents that are failing. this teaches the model how to handle your specific layout and fields.
also, check this. before sending the pdf to azure, try converting it to a high resolution image first. sometimes ocr engines perform better on a flattened image rather than a complex editable pdf. you can use a library like pdf2image for this. this might help in other tools too.
now for a general tip. always validate the ocr results with a human in the loop, especially during development. you can build a simple ui that shows the extracted text next to the original pdf. this helps you spot patterns in the errors and adjust your training data accordingly.
aha, and one more thing. check the quality of your source pdfs. low resolution or blurry text will always cause problems. make sure your documents are clear and high contrast for the best results.
good luck prasath. ocr is never perfect, but with some tuning, you can get it to a usable state. let me know if focusing on the custom model training helps.
Best regards,
Alex
and "yes" if you would follow me at Q&A - personaly thx.
P.S. If my answer help to you, please Accept my answer