Azure document Intelligence read model not producing properly searchable pdf for CJK language documents

Jayant, Ashish 0 Reputation points
2024-08-29T08:23:38.6866667+00:00

I am using the Azure document Intelligence read model to produce searchable PDFs, but the text in the PDF is not properly searchable. Whenever I try to copy text and search for it, I get characters like "ÿ ÿ ÿ ÿ" for any CJK language document. How can I fix this issue?

User's image

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,613 questions
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 7,845 Reputation points Microsoft Vendor
    2024-08-29T13:59:40.6866667+00:00

    Hi @Jayant, Ashish,

    The issue you're experiencing with garbled text ("ÿ ÿ ÿ ÿ") in searchable PDFs generated by the Azure Document Intelligence Read model for CJK languages is likely due to encoding or font issues during PDF creation. I tried to reproduce the issue with the given screenshot of text and was able to do so.

    To fix this, check that the PDF generation tool embeds the necessary CJK fonts correctly. Consider generating the PDF in PDF/A format to ensure fonts are embedded and text is properly encoded. Proper font embedding and specifying the language are key to resolving this issue.

    User's image

    See: Document Intelligence read model.

    I hope you understand! Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.