PDF actions
PDF actions enable you to extract images, text, and tables from PDF files, and arrange pages to create new documents.
To extract text from a PDF file, use the Extract text from PDF action. The following example extracts text from a specific range of pages of a password-protected file. The password is specified in the Advanced settings.
To extract texts arranged in a tabular form, enable Optimize for structured data to improve the results' format and accuracy.
To extract tables from a PDF file, deploy the Extract tables from PDF action, select the file, and specify the pages to extract from.
The action produces the ExtractedPDFTables variable that contains a list of PDF table info. To find information about this type of list, go to Advanced data types.
Note
- The Extract tables from PDF action doesn't use Optical Character Recognition (OCR), so you can't extract non-copyable text from scanned PDFs.
- The library behind the action occasionally extracts additional PDF data that aren't tables. This functionality minimizes the risk of accidentally omitting a real table.
Apart from extracting information from PDF files, you can create a new PDF document from an existing file using the Extract PDF file pages to new PDF file action.
The following example selects a combination of specific pages and a range of pages.
Extract text from PDF
You can extract text from a PDF file by using the "Extract text from PDF" action. In the action properties you can define the source PDF file and the pages that text should be extracted from. Under the advanced action properties you can define a password in case the PDF file is protected and if the engine should optimize for structured data or not.
Input parameters
Argument | Optional | Accepts | Default Value | Description |
---|---|---|---|---|
PDF file | No | File | The PDF file to extract text from. Enter a file path, a variable containing a file or a text path | |
Page(s) to extract | N/A | All, Single, Range | All | Specifies how many pages to extract: All pages, a single page or a range of pages |
Single page number | No | Numeric value | The number of the single page to extract text from | |
From page number | No | Numeric value | The first page number from the range of pages to extract text from | |
To page number | No | Numeric value | The last page number from the range of pages to extract text from | |
Password | Yes | Direct encrypted input or Text value | The password of the PDF file. Leave this blank if the PDF isn't password protected | |
Optimize for structured data | N/A | Boolean value | False | Specify whether to detect formatted layout in the document and extract text accordingly |
Variables produced
Argument | Type | Description |
---|---|---|
ExtractedPDFText | Text value | The extracted text |
Exceptions
Exception | Description |
---|---|
PDF file doesn't exist | File doesn't exist on the given path |
Invalid password | The given password is invalid |
Failed to extract text | Error while trying to extract text |
Extract tables from PDF
You can extract tables that are contained in a PDF file by using the Extract tables from PDF action. In the action properties you can define the PDF file and the range of pages that the tables will be extracted from. Under the advanced action properties you can define a password in case a the PDF file is protected, define if the table has headers or not, and finally if tables that cross page margins should be merged or not.
Input parameters
Argument | Optional | Accepts | Default Value | Description |
---|---|---|---|---|
PDF file | No | File | The PDF file to extract tables from. Enter a file path, a variable containing a file or a text path | |
Page(s) to extract | N/A | All, Single, Range | All | Specifies how many pages to extract tables from: all pages, a single page or a range of pages |
Single page number | No | Numeric value | The number of the single page to extract tables from | |
From page number | No | Numeric value | The first page number from the range of pages to extract tables from | |
To page number | No | Numeric value | The last page number from the range of pages to extract tables from | |
Password | Yes | Direct encrypted input or Text value | The password of the PDF file. Leave this blank if the PDF isn't password protected | |
Merge tables that cross page margins | N/A | Boolean value | True | Specifies whether to merge tables that cross page margins in the specified page range |
First line contains column names | N/A | Boolean value | True | Specifies whether the first line of table contains column names |
Variables produced
Argument | Type | Description |
---|---|---|
ExtractedPDFTables | List of PDF table info | The extracted tables with their info as a list |
Exceptions
Exception | Description |
---|---|
PDF file doesn't exist | File doesn't exist on the given path |
Invalid password | The given password is invalid |
Failed to extract tables | Error while trying to extract tables |
Extract images from PDF
To extract images from a PDF file you can use the Extract images from PDF action. In the action parameters you can define the PDF file and the pages to extract images from, the naming convention of the extacted images and the target location of the saved images. You can also define a password if the PDF file is protected under the advanced settings.
Input parameters
Argument | Optional | Accepts | Default Value | Description |
---|---|---|---|---|
PDF file | No | File | The PDF file to extract images from. Enter a file path, a variable containing a file or a text path | |
Password | Yes | Direct encrypted input or Text value | The password of the PDF file. Leave this blank if the PDF isn't password protected | |
Page(s) to extract | N/A | All, Single, Range | All | Specifies how many pages to extract: All pages, a single page or a range of pages |
Single page number | No | Numeric value | The number of the single page to extract images from | |
From page number | No | Numeric value | The first page number from the range of pages to extract images from | |
To page number | No | Numeric value | The last page number from the range of pages to extract images from | |
Image(s) name | No | Text value | How the name of the image(s) starts. Extracted image(s) name example: GivenName_1, GivenName_2 | |
Save image(s) to | No | Folder | The folder to save the extracted images as png files |
Variables produced
This action doesn't produce any variables.
Exceptions
Exception | Description |
---|---|
Invalid password | The given password is invalid |
Failed to extract images | Indicates that an error occurred while extracting images from the given pages of the PDF |
Folder doesn't exist | Indicates that the folder doesn't exist |
PDF file doesn't exist | File doesn't exist on the given path |
Extract PDF file pages to new PDF file
You can create a new PDF file by extracting pages from an existing PDF file by using the PDF file pages to a new PDF file action. In the action parameters you can define the PDF file to extract the pages from, the page(s) to be extracted, the location of the new PDF file and what should happen if a file with the same name and extension already exists. Finally, under the advanced properties you can define a password in case the source PDF is protected.
Input parameters
Argument | Optional | Accepts | Default Value | Description |
---|---|---|---|---|
PDF file | No | File | The PDF file to extract pages from. Enter a file path, a variable containing a file or a text path | |
Password | Yes | Direct encrypted input or Text value | The password of the PDF file. Leave this blank if the PDF isn't password protected | |
Page selection | No | Text value | The index numbers of the pages to keep (for example, 1,3,17-24) | |
Extracted PDF path | No | File | The path to store the extracted PDF file | |
If file exists | N/A | Overwrite, Don't overwrite, Add sequential suffix | Add sequential suffix | Specifies what to do in case the output PDF file already exists |
Variables produced
Argument | Type | Description |
---|---|---|
ExtractedPDF | File | The new PDF file |
Exceptions
Exception | Description |
---|---|
Invalid password | The given password is invalid |
PDF file doesn't exist | File doesn't exist on the given path |
Page out of bounds | Indicates that one or more pages are out of bounds of the PDF file |
Invalid page selection | Indicates that the given pages aren't valid for the PDF file |
Failed to extract new PDF | Indicates that an error occurred while trying to extract new PDF |
Merge PDF files
Merges multiple PDF files into a new one.
You can use the Merge PDF files action to take two or more PDF files and merge them into a single file. The files to be merged can be provided either in the form of a list, or enclosed in double quotes and separated by a delimiter. You can also provide passwords for the PDF files, in case they are password-protected.
Input parameters
Argument | Optional | Accepts | Default Value | Description |
---|---|---|---|---|
PDF files | No | List of Files | The files to merge. Enclose multiple files in double quotes (") and separate them by a delimiter, or use a list of files | |
Merged PDF path | No | File | The path to store the merged PDF | |
If file exists | N/A | Overwrite, Don't overwrite, Add sequential suffix | Add sequential suffix | Specifies what to do in case the destination file already exists |
Passwords | Yes | Direct encrypted input or Text value | The delimited passwords. The order should be the same as the order of the input PDFs. Leave this blank if the PDFs aren't password protected | |
Delimiter | No | Text value | , | A custom password delimiter. This delimiter shouldn't be part of any of the passwords |
Variables produced
Argument | Type | Description |
---|---|---|
MergedPDF | File | The merged PDF file |
Exceptions
Exception | Description |
---|---|
PDF file doesn't exist | File doesn't exist on the given path |
Invalid password | The given password is invalid |
Failed to merge PDF files | Indicates that an error occurred while merging the files |