How to read text from a document skipping tables data using Document Intelligence Prebuilt Layout Model?

Radhika Jagtap 20 Reputation points


We are looking to pass in a PDF document which contains Paragraphs, Tables and Images. We want to extract only paragraphs data from the document with the help of 'Document Intelligence prebuilt layout model' . Currently the output contains JSON which has tables data as well.. Is there any way we can just get the text by skipping tables data from the doc? In short from entire pdf doc, we want to extract only paragraph lines and not the data from tables. Example: User's image

Expected Output: Document Intelligence should return only below text as a output: User's image

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,199 questions
{count} votes

Accepted answer
  1. santoshkc 1,980 Reputation points Microsoft Vendor

    Hi @Radhika Jagtap,

    Thank you for reaching out to Microsoft Q&A forum!
    As per your query, it seems like you want to extract only paragraphs data from a PDF document and skip the tables data using the Document Intelligence prebuilt layout model. However, the prebuilt layout model may not be able to provide the desired output as it extracts all the layout elements like tables, selection marks, titles, section headings, and more. In this case, I would suggest you to use a custom model. With a custom model, you can train the service to recognize only the paragraphs data in their PDF documents and skip the tables data as below:
    enter image description hereFor more info, please look into the Custom Model documentation.

    I hope you understanding and please feel free to reach out if you have any further questions or if there's anything else I can assist you with.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

0 additional answers

Sort by: Most helpful