Document Intellegence scanned PDF or PNG to editable pdf

glen sale 41 Reputation points
2024-02-27T22:27:54.18+00:00

Hi I am new to JavaScript and trying to find ways to use azure-ai to read scanned pdf and redact PII important information using data labeling PII.

https://documentintelligence.ai.azure.com/studio

I noticed that document intelligence is able to OCR. Read text from scanned pdf.

But how do I use my scanned pdf or PNG to trigger OCR (Document Intelligence) to save a new editable PDF.

Right now I am only able to save my scanned PDF or PNG into a text file losing the actual structure of the PNG or PDF.
I was hoping the structure would remain the same like how ADOBE PRO does OCR and then do redacting once PDF is editable. User's image

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,614 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,771 questions
{count} votes

2 answers

Sort by: Most helpful
  1. YutongTie-MSFT 50,811 Reputation points
    2024-02-28T00:59:41.1166667+00:00

    @glen sale

    Thanks for reaching out to us, Azure AI provides Document Intelligence service that can extract text and structure from scanned PDF or PNG files. However, it does not directly support converting these files into an editable PDF while retaining the original structure. Instead, it outputs the extracted data in a JSON format that includes the text, bounding box coordinates, and confidence score.

    To achieve your goal of creating an editable PDF, you would likely need to combine Azure's OCR capabilities with a PDF generation library. Here's a simplified workflow:

    1. OCR: Use Azure Form Recognizer to extract text and its position from the scanned PDF or PNG.
    2. Redaction: Identify and redact PII information from the extracted text. You can use Azure Text Analytics for this, which provides a pre-trained PII recognition model.
    3. PDF Generation: Use a PDF generation library to create a new PDF. You can position the text based on the coordinates provided by Form Recognizer. Some popular JavaScript libraries for this are jsPDF and PDFKit.
    4. Make PDF Editable: In order to make the generated PDF editable, you would need to add form fields at the appropriate positions. This might be challenging as it requires determining where these fields should be placed based on the text's position.

    Please note that this would be a non-trivial task and might require a significant amount of development work. It might also not provide results as good as specialized software like Adobe Pro, especially when it comes to preserving the original layout and formatting.

    As for redaction, Azure does not currently provide a built-in way to redact information in images or PDFs. You would need to implement this functionality yourself, for example, by drawing black boxes over the sensitive information in the generated PDF.

    I hope this helps.

    Regards,

    Yutong

    1 person found this answer helpful.
    0 comments No comments

  2. glen sale 41 Reputation points
    2024-02-27T22:36:36.68+00:00

    Here is the code I have so far. It won't let me insert as code block.
    Pasted as Text
    ocr.txt

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.