Extract long texts with headings from pdf as key-value pair with azure form recognizer

Suman Gautam (Slalom Consulting) 0 Reputation points
2023-09-12T18:25:11.4533333+00:00

I am analyzing pdf documents to extract texts. These pdf documents are reports so the format is not like receipts with short text. Instead, there will be heading/sub-headings following by paragraph. How can I extract these text into dictionary like format.
e.g:

Key: Summary
Value: Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract key/value pairs and table data from documents using a custom model consisting of 5 filled in forms or an empty form and two filled in forms without any human inputs. When you submit your input data, the algorithm clusters the forms by type, discovers what keys and tables are present,and associates values to keys and entries to tables. Azure Form Recognizer also has a pre-built model for reading sales receipts which will not be discussed here.

Thank you

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,122 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. VasaviLankipalle-MSFT 18,676 Reputation points Moderator
    2023-09-13T22:17:48.3966667+00:00

    Hello @Suman Gautam (Slalom Consulting) , Thanks for using Microsoft Q&A Platform.

    As we know the functionality of the layout model where it analyzes a document to extract title, heading and paragraphs. In your requirement you are looking for a key-value pair between the heading and the paragraph text.

    The key-value pair functionality is different as we can see in the General document model: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-general-document?view=doc-intel-3.1.0#key-value-pairs

    The key-value pairs are supported in the respective pre-built models like General document model. With custom models we can train the model with the labels a dataset of documents with the values.

    I believe these are possible models that extracts key-value pairs. May I know the model you are looking for?

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.