Document Intellegence scanned PDF or PNG to editable pdf

Question

Document Intellegence scanned PDF or PNG to editable pdf

glen sale 41

Hi I am new to JavaScript and trying to find ways to use azure-ai to read scanned pdf and redact PII important information using data labeling PII.

https://documentintelligence.ai.azure.com/studio

I noticed that document intelligence is able to OCR. Read text from scanned pdf.

But how do I use my scanned pdf or PNG to trigger OCR (Document Intelligence) to save a new editable PDF.

Right now I am only able to save my scanned PDF or PNG into a text file losing the actual structure of the PNG or PDF.
I was hoping the structure would remain the same like how ADOBE PRO does OCR and then do redacting once PDF is editable. User's image

glen sale 41 Reputation points

2024-02-28T21:07:34.07+00:00

Is Azure Form Recognizer named Document Intellegence I can't seem to find it in azure portal
YutongTie-MSFT 53,976 Reputation points Moderator

2024-02-28T22:00:42.5933333+00:00

@glen sale Yes, Azure Form Recognizer was renamed to Azure AI Document Intelligence.

2 answers

Your answer

glen sale 41 Reputation points

2024-02-28T21:07:34.07+00:00

Is Azure Form Recognizer named Document Intellegence I can't seem to find it in azure portal
YutongTie-MSFT 53,976 Reputation points Moderator

2024-02-28T22:00:42.5933333+00:00

@glen sale Yes, Azure Form Recognizer was renamed to Azure AI Document Intelligence.

Answer 1

@glen sale

Thanks for reaching out to us, Azure AI provides Document Intelligence service that can extract text and structure from scanned PDF or PNG files. However, it does not directly support converting these files into an editable PDF while retaining the original structure. Instead, it outputs the extracted data in a JSON format that includes the text, bounding box coordinates, and confidence score.

To achieve your goal of creating an editable PDF, you would likely need to combine Azure's OCR capabilities with a PDF generation library. Here's a simplified workflow:

OCR: Use Azure Form Recognizer to extract text and its position from the scanned PDF or PNG.
Redaction: Identify and redact PII information from the extracted text. You can use Azure Text Analytics for this, which provides a pre-trained PII recognition model.
PDF Generation: Use a PDF generation library to create a new PDF. You can position the text based on the coordinates provided by Form Recognizer. Some popular JavaScript libraries for this are jsPDF and PDFKit.
Make PDF Editable: In order to make the generated PDF editable, you would need to add form fields at the appropriate positions. This might be challenging as it requires determining where these fields should be placed based on the text's position.

Please note that this would be a non-trivial task and might require a significant amount of development work. It might also not provide results as good as specialized software like Adobe Pro, especially when it comes to preserving the original layout and formatting.

As for redaction, Azure does not currently provide a built-in way to redact information in images or PDFs. You would need to implement this functionality yourself, for example, by drawing black boxes over the sensitive information in the generated PDF.

I hope this helps.

Regards,

Yutong

Answer 2

glen sale 41

Here is the code I have so far. It won't let me insert as code block.
Pasted as Text
ocr.txt

Share via

Document Intellegence scanned PDF or PNG to editable pdf

2 answers

Your answer