Thanks for reaching out to us, Azure AI provides Document Intelligence service that can extract text and structure from scanned PDF or PNG files. However, it does not directly support converting these files into an editable PDF while retaining the original structure. Instead, it outputs the extracted data in a JSON format that includes the text, bounding box coordinates, and confidence score.
To achieve your goal of creating an editable PDF, you would likely need to combine Azure's OCR capabilities with a PDF generation library. Here's a simplified workflow:
- OCR: Use Azure Form Recognizer to extract text and its position from the scanned PDF or PNG.
- Redaction: Identify and redact PII information from the extracted text. You can use Azure Text Analytics for this, which provides a pre-trained PII recognition model.
- PDF Generation: Use a PDF generation library to create a new PDF. You can position the text based on the coordinates provided by Form Recognizer. Some popular JavaScript libraries for this are jsPDF and PDFKit.
- Make PDF Editable: In order to make the generated PDF editable, you would need to add form fields at the appropriate positions. This might be challenging as it requires determining where these fields should be placed based on the text's position.
Please note that this would be a non-trivial task and might require a significant amount of development work. It might also not provide results as good as specialized software like Adobe Pro, especially when it comes to preserving the original layout and formatting.
As for redaction, Azure does not currently provide a built-in way to redact information in images or PDFs. You would need to implement this functionality yourself, for example, by drawing black boxes over the sensitive information in the generated PDF.
I hope this helps.
Regards,
Yutong