How to extract PDF attachments

KaziNad 51

I receive PDF files (electronic invoices), and there are xml files embedded inside the PDF files I would like to extract and process in a Logic App. I could not find a way to extract the files embedded in PDFs. Any idea? Thanks.

ChaitanyaNaykodi-MSFT 23,031 Reputation points Microsoft Employee

2020-08-07T19:17:51.823+00:00

Hi @KaziNad , Thank you for reaching out. Just a few questions. Can you please tell us how the xml files are embedded in the pdf? Is it a URL which when clicked fetches the file or does the pdf contains the file directly?
Meanwhile, we do provide Cloudmersive and Aquaforest pdf connectors for logic app which can be useful to parse pdf and get the text in it. If the file needs to be fetched using a URL, you can extract the URL as mentioned above and maybe leverage Docparser to fetch the file (This will work if the files are stored under a publicly accessible URL).
Please let me know if this helps in resolving the issue or not? I will be glad to continue with our discussion.
KaziNad 51 Reputation points

2020-08-08T14:53:27.247+00:00

@ChaitanyaNaykodi-MSFT : PDF contains the file directly, not an url. See this screenshot with Firefox PDF viewer:
ChaitanyaNaykodi-MSFT 23,031 Reputation points Microsoft Employee

2020-08-11T17:33:52.87+00:00

@KaziNad , Thank you for the reply. I am trying the scenario out myself and will make a response soon.
ChaitanyaNaykodi-MSFT 23,031 Reputation points Microsoft Employee

2020-08-13T20:37:56.667+00:00

Hello @KaziNad , just following up here to see if my response below was helpful or not?
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Accepted answer

ChaitanyaNaykodi-MSFT 23,031 Reputation points Microsoft Employee

2020-08-12T19:19:00.787+00:00

Hello @KaziNad , Sorry for the delay in my response. Currently none of the pdf connectors for logic app support the functionality to extract attached files from the pdf document. An alternate method to extract the attached ‘xml’ file will be to integrate a Function app within your Logic App. You can find more information here about how to call a function app using your logic app. We found this thread which we think might be helpful in implementing the code required to extract the ‘xml’ attachment in pdf using C# language.
Please let me know if you need any additional information, I will be glad to continue with our discussions.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

4 additional answers

Vishant Pandey 6 Reputation points

2021-07-20T04:55:24.947+00:00

1.Install Nuget Package of IronPdf into your project
2.Follow the link: reading-pdf-text

PdfDocument PDF = PdfDocument.FromFile(@"D:\demoSp.pdf"); // D:\demoSp.pdf full path of your input pdf file
FileContent.Text = PDF.ExtractAllText();
Please sign in to rate this answer.

1 person found this answer helpful.

0 comments No comments
Sign in to comment
Ezreal95 1 Reputation point

2021-01-12T09:48:26.33+00:00

You could try Spire.PDF library to extract attachments from PDF using C#.

//Load PDF
PdfDocument pdf = new PdfDocument("Attachment1.pdf");
//Get the first attachment
PdfAttachment attachment = pdf.Attachments[0];
//Write to file
File.WriteAllBytes(attachment.FileName, attachment.Data);
Please sign in to rate this answer.

0 comments No comments
Sign in to comment
Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more
Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more