Azure document Intelligence read model not producing properly searchable pdf for CJK language documents

Question

Azure document Intelligence read model not producing properly searchable pdf for CJK language documents

Jayant, Ashish 0

I am using the Azure document Intelligence read model to produce searchable PDFs, but the text in the PDF is not properly searchable. Whenever I try to copy text and search for it, I get characters like "ÿ ÿ ÿ ÿ" for any CJK language document. How can I fix this issue?

User's image

Jayant, Ashish 0 Reputation points

2024-08-30T08:47:19.65+00:00

Hi @santoshkc

Is it possible to resolve this issue during the OCR process using the Azure Document Intelligence Read model, and provide a searchable PDF with properly embedded fonts and text encoding, instead of using another PDF generation tool to fix this? If not, could you please provide a sample code to address this issue? I have tried using the PdfSharp tool but haven’t had any luck.
santoshkc 15,245 Reputation points Microsoft External Staff Moderator

2024-08-30T17:32:36.06+00:00

Hi @Jayant, Ashish,

Thank for your follow-up query.

Unfortunately, the Azure Document Intelligence Read model does not have built-in functionality to embed fonts or fix text encoding issues in the generated searchable PDFs during the OCR process. Unfortunately, there's no sample code available to directly address this issue. To resolve this issue, you'll need to post-process the PDF to ensure it should be proper font.

I hope you understand! Thank you.

1 answer

Your answer

Jayant, Ashish 0 Reputation points

2024-08-30T08:47:19.65+00:00

Hi @santoshkc

Is it possible to resolve this issue during the OCR process using the Azure Document Intelligence Read model, and provide a searchable PDF with properly embedded fonts and text encoding, instead of using another PDF generation tool to fix this? If not, could you please provide a sample code to address this issue? I have tried using the PdfSharp tool but haven’t had any luck.
santoshkc 15,245 Reputation points Microsoft External Staff Moderator

2024-08-30T17:32:36.06+00:00

Hi @Jayant, Ashish,

Thank for your follow-up query.

Unfortunately, the Azure Document Intelligence Read model does not have built-in functionality to embed fonts or fix text encoding issues in the generated searchable PDFs during the OCR process. Unfortunately, there's no sample code available to directly address this issue. To resolve this issue, you'll need to post-process the PDF to ensure it should be proper font.

I hope you understand! Thank you.

Answer 1

Hi @Jayant, Ashish,

The issue you're experiencing with garbled text ("ÿ ÿ ÿ ÿ") in searchable PDFs generated by the Azure Document Intelligence Read model for CJK languages is likely due to encoding or font issues during PDF creation. I tried to reproduce the issue with the given screenshot of text and was able to do so.

To fix this, check that the PDF generation tool embeds the necessary CJK fonts correctly. Consider generating the PDF in PDF/A format to ensure fonts are embedded and text is properly encoded. Proper font embedding and specifying the language are key to resolving this issue.

User's image

See: Document Intelligence read model.

I hope you understand! Thank you.

Share via

Azure document Intelligence read model not producing properly searchable pdf for CJK language documents

1 answer

Your answer