Hi Abhinav Thakare,
Thanks for reaching out to Microsoft Q&A.
Use the azure layout model (prebuilt-layout
) in azure document intelligence is designed to extract text, tables, and images from documents while preserving their structural information. It identifies images as separate entities and provides their positions and sizes without applying OCR to the image content itself. Send your document to the Layout model endpoint, the response will include both the text and the images, along with their bounding boxes and relationships.
import requests
endpoint = "https://<your-form-recognizer-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2023-07-31"
headers = {
'Content-Type': 'application/pdf',
'Ocp-Apim-Subscription-Key': '<your-subscription-key>',
}
with open('your_document.pdf', 'rb') as f:
data_bytes = f.read()
response = requests.post(endpoint, headers=headers, data=data_bytes)
result = response.json()
Extract Images Separately:
- Extract Image Positions:
- Use the bounding box information from the Layout model to identify where images are located in the document.
- Extract Image Content:
- For PDFs: Use a library like
PyMuPDF
orpdfminer.six
to extract images based on their positions.- For DOCX: Use
python-docx
to directly extract embedded images
- For DOCX: Use
# Example using PyMuPDF to extract images from a PDF
import fitz # PyMuPDF
doc = fitz.open('your_document.pdf')
for page_index in range(len(doc)):
page = doc[page_index]
image_list = page.get_images()
for image_index, img in enumerate(image_list):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
pix.save(f'image_page{page_index}_{image_index}.png')
Combine Text and Images into Your Template:
- Choose a Document Generation Library:
- For DOCX: Use
python-docx
to create and manipulate Word documents. - For PDFs: Use
ReportLab
orPyPDF2
to generate PDFs.
- For DOCX: Use
- Populate the Template:
- Insert the extracted text into the appropriate placeholders in your template.
- Insert images at the locations corresponding to their original positions or as defined by your template.
from docx import Document
doc = Document('your_template.docx')
# Insert text and images as needed
doc.save('new_document.docx')
Avoid OCR on Images:
- Ensure Correct Model Usage:
- By using the Layout model, you're instructing Azure Document Intelligence not to apply OCR to image content.
- Post-Processing Checks:
- Verify that no OCR text is extracted from image regions by inspecting the model's output.
By using the Layout model in azure document intelligence, you can extract text and images from your documents while treating images as single entities without OCR. You can then programmatically populate your predefined template with this extracted content using document generation libraries.
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.