How to extract table text from pdf documents using Form Recognizer service?

Question

How to extract table text from pdf documents using Form Recognizer service?

MachineLearning 6

I'm looking out for a way to extract tables text present in a PDF document using form recognizer. I tried creating a custom model for training with labels wherein different labels were defined using the OCR labeling tool. Although, the accuracy received is ~30% which is really less. A sample image of the table is attached (please ignore the red color oval lines).

Is there a possibility to extract the content of the above table image into a .csv file?

2 answers

Your answer

Answer 1

GiftA-MSFT 11,176 Moderator

Hi, thanks for reaching out. Currently, the supported output format is JSON. However, you can try to reformat the output to pandas dataframe and export to csv as shown in this example or check out other available resources online. Hope this helps.

Answer 2

That image should be pretty easy to process using forms recognizer. You will need a minimum of 5 images to train the model (but more will give better accuracy). You need to tag only the data elements you need (not any labels). If you have repeating groups such as services then tag each one with a separate tag service1, service2, service3, etc and the same for event, etc. Then you be able to see the accuracy grow as you add more tags across more invoices. I would create a simple azure function that monitors blob storage and then on receipt of a new file submits to form recognizer and takes the returned json output and writes back to blob storage a CSV (you may be able to write less code using a logic app).

Share via

How to extract table text from pdf documents using Form Recognizer service?

2 answers

Your answer