Extract data from PDF
APPLIES TO:
Azure Data Factory
Azure Synapse Analytics
This article describes a solution template that you can use to extract data from a PDF source using Azure Data Factory and Form Recognizer.
About this solution template
This template analyzes data from a PDF URL source using two Azure Form Recognizer calls. Then, it transforms the output to readable tables in a dataflow and outputs the data to a storage sink.
This template contains two activities:
- Web Activity to call Azure Form Recognizer's layout model API
- Data flow to transform extracted data from PDF
This template defines 4 parameters:
- FormRecognizerURL is the Form recognizer URL ("https://{endpoint}/formrecognizer/v2.1/layout/analyze"). Replace {endpoint} with the endpoint that you obtained with your Form Recognizer subscription. You need to replace the default value with your own URL.
- FormRecognizerKey is the Form Recognizer subscription key. You need to replace the default value with your own subscription key.
- PDF_SourceURL is the URL of your PDF source. You need to replace the default value with your own URL.
- outputFolder is the name of the folder path where you want your files to be in your destination store. You need to replace the default value with your own folder path.
Prerequisites
- Azure Form Recognizer Resource Endpoint URL and Key (create a new resource here)
How to use this solution template
Go to template Extract data from PDF. Create a New connection to your Form Recognizer resource or choose an existing connection.
In your connection to Form Recognizer, make sure to add a Linked service Parameter. You will need to use this parameter as your dynamic Base URL.
Create a New connection to your destination storage store or choose an existing connection.
Select Use this template.
You should see the following pipeline:
Select Debug.
Enter parameter values, review results, and publish.
Next steps
Feedback
Submit and view feedback for