Extract data from PDF

APPLIES TO: Azure Data Factory Azure Synapse Analytics

This article describes a solution template that you can use to extract data from a PDF source using Azure Data Factory and Form Recognizer.

About this solution template

This template analyzes data from a PDF URL source using two Azure Form Recognizer calls. Then, it transforms the output to readable tables in a dataflow and outputs the data to a storage sink.

This template contains two activities:

  • Web Activity to call Azure Form Recognizer's layout model API
  • Data flow to transform extracted data from PDF

This template defines 4 parameters:

  • FormRecognizerURL is the Form recognizer URL ("https://{endpoint}/formrecognizer/v2.1/layout/analyze"). Replace {endpoint} with the endpoint that you obtained with your Form Recognizer subscription. You need to replace the default value with your own URL.
  • FormRecognizerKey is the Form Recognizer subscription key. You need to replace the default value with your own subscription key.
  • PDF_SourceURL is the URL of your PDF source. You need to replace the default value with your own URL.
  • outputFolder is the name of the folder path where you want your files to be in your destination store. You need to replace the default value with your own folder path.

Prerequisites

  • Azure Form Recognizer Resource Endpoint URL and Key (create a new resource here)

How to use this solution template

  1. Go to template Extract data from PDF. Create a New connection to your Form Recognizer resource or choose an existing connection.

    Screenshot of how to create a new connection or select an existing connection from a drop down menu to Form Recognizer in template set up.

    In your connection to Form Recognizer, make sure to add a Linked service Parameter. You will need to use this parameter as your dynamic Base URL.

    Screenshot of where to add your Form Recognizer linked service parameter.

    Screenshot of the linked service base URL that references the linked service parameter.

  2. Create a New connection to your destination storage store or choose an existing connection.

    Screenshot of how to create a new connection or select existing connection from a drop down menu to your sink in template set up.

  3. Select Use this template.

    Screenshot of how to complete the template by clicking use this template at the bottom of the screen.

  4. You should see the following pipeline:

    Screenshot of pipeline view with web activity linking to a dataflow activity.

  5. Select Debug.

    Screenshot of how to Debug pipeline using the debug button on the top banner of the screen.

  6. Enter parameter values, review results, and publish.

    Screesnhot of where to enter pipeline debug parameters on a panel to the right.

    Screenshot of the results that return when the pipeline is triggered.

Next steps