How to read text from a document skipping tables data using Document Intelligence Prebuilt Layout Model?

Question

Hello,

We are looking to pass in a PDF document which contains Paragraphs, Tables and Images. We want to extract only paragraphs data from the document with the help of 'Document Intelligence prebuilt layout model' . Currently the output contains JSON which has tables data as well.. Is there any way we can just get the text by skipping tables data from the doc? In short from entire pdf doc, we want to extract only paragraph lines and not the data from tables. Example: User's image

Expected Output: Document Intelligence should return only below text as a output: User's image

Accepted Answer

Hi @Radhika Jagtap,

Thank you for reaching out to Microsoft Q&A forum!
As per your query, it seems like you want to extract only paragraphs data from a PDF document and skip the tables data using the Document Intelligence prebuilt layout model. However, the prebuilt layout model may not be able to provide the desired output as it extracts all the layout elements like tables, selection marks, titles, section headings, and more. In this case, I would suggest you to use a custom model. With a custom model, you can train the service to recognize only the paragraphs data in their PDF documents and skip the tables data as below:
enter image description here For more info, please look into the Custom Model documentation.

I hope you understanding and please feel free to reach out if you have any further questions or if there's anything else I can assist you with.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

How to read text from a document skipping tables data using Document Intelligence Prebuilt Layout Model?

0 additional answers