PDF text not extracted

Paul Pawletta 21 Reputation points
2022-11-01T15:01:11.003+00:00

Hi, I have one PDF document, where my custom neural model returns the text in some weird encoded way. The entity bounding boxes seem correct, just the text content is bad in the JSON response and also visualized in Form Recognizer Studio. Sending the PDF document converted to a JPEG gives me the correct text entities!

Is there any requirement for the PDF document? Unfortunately I can't share the original document here, because it contains customer info.

256067-screenshot-2022-11-01-at-155424.png

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Paul Pawletta 21 Reputation points
    2022-11-02T08:22:28.673+00:00

    @YutongTie-MSFT yes the document is in English.
    In my case for this particular 1-page PDF document, all the text is in a wrong encoding. Sending the same page converted to a JPEG gives me the correct text. So it looks to me like there is a problem with the OCR on the document.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.