@Liam Slade Welcome to Microsoft Q&A Forum, Thank you for posting your query here!
.
The Prebuilt Read API in Azure AI Document Intelligence is great for extracting text, but it doesn’t directly support overlaying the extracted text onto the original PDFs. However, you can achieve this by combining the OCR results with a PDF manipulation library in Python, such as PyMuPDF
or reportlab
.
.
Here’s a step-by-step approach to help you:
- Extract Text Using Prebuilt Read API: Continue using the Prebuilt Read API to extract text from your PDFs.
- Overlay Text on PDF: Use a PDF manipulation library to overlay the extracted text onto the original PDF. Here’s a sample code snippet which I haven't tested at my end. You might have to debug it and re-code it further:
import fitz # PyMuPDF
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Azure Form Recognizer credentials
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"
# Initialize the client
client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
# Function to extract text using Prebuilt Read API
def extract_text_from_pdf(pdf_path):
with open(pdf_path, "rb") as f:
poller = client.begin_analyze_document("prebuilt-read", document=f)
result = poller.result()
return result
# Function to overlay text on PDF
def overlay_text_on_pdf(pdf_path, result):
doc = fitz.open(pdf_path)
for page_num, page in enumerate(doc):
for line in result.pages[page_num].lines:
for word in line.words:
# Extract x and y coordinates from bounding box
bounding_box = word.bounding_box
x_coords = [bounding_box[i] for i in range(0, len(bounding_box), 2)]
y_coords = [bounding_box[i + 1] for i in range(0, len(bounding_box), 2)]
# Create a rectangular bounding box that encloses the polygon
rect = fitz.Rect(min(x_coords), min(y_coords), max(x_coords), max(y_coords))
# Insert the text inside the bounding box
page.insert_textbox(rect, word.content, fontsize=8, color=(0, 0, 0))
output_path = "output_" + pdf_path
doc.save(output_path)
return output_path
# Example usage
pdf_path = "path_to_your_pdf.pdf"
result = extract_text_from_pdf(pdf_path)
output_pdf = overlay_text_on_pdf(pdf_path, result)
print(f"Processed PDF saved at: {output_pdf}")
.
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.