PDF text not extracted

Question

PDF text not extracted

Paul Pawletta 21

Hi, I have one PDF document, where my custom neural model returns the text in some weird encoded way. The entity bounding boxes seem correct, just the text content is bad in the JSON response and also visualized in Form Recognizer Studio. Sending the PDF document converted to a JPEG gives me the correct text entities!

Is there any requirement for the PDF document? Unfortunately I can't share the original document here, because it contains customer info.

YutongTie-MSFT 53,966 Reputation points Moderator

2022-11-01T23:27:41.93+00:00
Hello @PaulPawletta

I have seen some limitations of Custom Neural model as below, is your document English?

The model doesn't recognize values split across page boundaries.

Custom neural models are only trained in English and model performance will be lower for documents in other languages.

If a dataset labeled for custom template models is used to train a custom neural model, the unsupported field types are ignored.

Custom neural models are limited to 10 build operations per month. Open a support request if you need the limit increased.

Could you please check on below document to see if you name your field properly to help it read?

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-custom-neural?view=form-recog-3.0.0

Regards,
Yutong

1 answer

Your answer

YutongTie-MSFT 53,966 Reputation points Moderator

2022-11-01T23:27:41.93+00:00

Hello @PaulPawletta

I have seen some limitations of Custom Neural model as below, is your document English?

The model doesn't recognize values split across page boundaries.

Custom neural models are only trained in English and model performance will be lower for documents in other languages.

If a dataset labeled for custom template models is used to train a custom neural model, the unsupported field types are ignored.

Custom neural models are limited to 10 build operations per month. Open a support request if you need the limit increased.

Could you please check on below document to see if you name your field properly to help it read?

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-custom-neural?view=form-recog-3.0.0

Regards,
Yutong

Answer 1

Paul Pawletta 21

@YutongTie-MSFT yes the document is in English.
In my case for this particular 1-page PDF document, all the text is in a wrong encoding. Sending the same page converted to a JPEG gives me the correct text. So it looks to me like there is a problem with the OCR on the document.

Share via

PDF text not extracted

1 answer

Your answer