Transferring Document Data to a Predefined Template

Abhinav Thakare 0 Reputation points
2024-10-05T13:01:59.3766667+00:00

I have some documents in either pdf or docs format using this data i want to create new document in specific template.

I tried to run Document Intelligence for extracting text and associated images form document , but it is applying ocr on images too (which I don’t want).

the images must be handled as single entity associated with that text

can someone help here

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,232 questions
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,011 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 32,106 Reputation points MVP
    2024-10-06T06:53:46.02+00:00

    Hi Abhinav Thakare,

    Thanks for reaching out to Microsoft Q&A.

    Use the azure layout model (prebuilt-layout) in azure document intelligence is designed to extract text, tables, and images from documents while preserving their structural information. It identifies images as separate entities and provides their positions and sizes without applying OCR to the image content itself. Send your document to the Layout model endpoint, the response will include both the text and the images, along with their bounding boxes and relationships.

    import requests
    endpoint = "https://<your-form-recognizer-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2023-07-31"
    headers = {
        'Content-Type': 'application/pdf',
        'Ocp-Apim-Subscription-Key': '<your-subscription-key>',
    }
    with open('your_document.pdf', 'rb') as f:
        data_bytes = f.read()
    response = requests.post(endpoint, headers=headers, data=data_bytes)
    result = response.json()
    
    
    

    Extract Images Separately:

    • Extract Image Positions:
      • Use the bounding box information from the Layout model to identify where images are located in the document.
    • Extract Image Content:
    • For PDFs: Use a library like PyMuPDF or pdfminer.six to extract images based on their positions.
      • For DOCX: Use python-docx to directly extract embedded images
    # Example using PyMuPDF to extract images from a PDF
    import fitz  # PyMuPDF
    doc = fitz.open('your_document.pdf')
    for page_index in range(len(doc)):
        page = doc[page_index]
        image_list = page.get_images()
        for image_index, img in enumerate(image_list):
            xref = img[0]
            pix = fitz.Pixmap(doc, xref)
            pix.save(f'image_page{page_index}_{image_index}.png')
    
    
    

    Combine Text and Images into Your Template:

    • Choose a Document Generation Library:
      • For DOCX: Use python-docx to create and manipulate Word documents.
      • For PDFs: Use ReportLab or PyPDF2 to generate PDFs.
    • Populate the Template:
      • Insert the extracted text into the appropriate placeholders in your template.
      • Insert images at the locations corresponding to their original positions or as defined by your template.
    from docx import Document
    doc = Document('your_template.docx')
    # Insert text and images as needed
    doc.save('new_document.docx')
    
    
    

    Avoid OCR on Images:

    • Ensure Correct Model Usage:
      • By using the Layout model, you're instructing Azure Document Intelligence not to apply OCR to image content.
    • Post-Processing Checks:
      • Verify that no OCR text is extracted from image regions by inspecting the model's output.

    By using the Layout model in azure document intelligence, you can extract text and images from your documents while treating images as single entities without OCR. You can then programmatically populate your predefined template with this extracted content using document generation libraries.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.