How to OCR a PDF using the prebuilt read API in Python?

Question

How to OCR a PDF using the prebuilt read API in Python?

Liam Slade 0

I'm using the Prebuilt Read API in Python to perform OCR on PDF documents from a folder. I can successfully upload and OCR the PDFs, but I'm having trouble downloading the resulting PDFs with the extracted text overlayed onto them. How can I modify my code to download the processed PDFs with the OCR text included? Is there any sample code or method that allows me to do this efficiently?

navba-MSFT 27,545 Reputation points Microsoft Employee Moderator

2024-10-07T03:39:50.3066667+00:00

@Liam Slade Just following up to check if my suggestion helped. Please let me know if you have any follow-up queries.
Liam Slade 0 Reputation points

2024-10-07T17:05:27.8966667+00:00

@navba-MSFT Thank you for following up. I used the AnalyzeOutputOption and AnalyzeResult classes from the azure.ai.documentintelligence.models package. These classes helped return a PDF with searchable text, essentially providing an OCRed PDF
navba-MSFT 27,545 Reputation points Microsoft Employee Moderator

2024-10-08T05:08:14.4533333+00:00

@Liam Slade Thanks for the update.

1 answer

Your answer

navba-MSFT 27,545 Reputation points Microsoft Employee Moderator

2024-10-07T03:39:50.3066667+00:00

@Liam Slade Just following up to check if my suggestion helped. Please let me know if you have any follow-up queries.
Liam Slade 0 Reputation points

2024-10-07T17:05:27.8966667+00:00

@navba-MSFT Thank you for following up. I used the AnalyzeOutputOption and AnalyzeResult classes from the azure.ai.documentintelligence.models package. These classes helped return a PDF with searchable text, essentially providing an OCRed PDF
navba-MSFT 27,545 Reputation points Microsoft Employee Moderator

2024-10-08T05:08:14.4533333+00:00

@Liam Slade Thanks for the update.

Answer 1

@Liam Slade Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

. The Prebuilt Read API in Azure AI Document Intelligence is great for extracting text, but it doesn’t directly support overlaying the extracted text onto the original PDFs. However, you can achieve this by combining the OCR results with a PDF manipulation library in Python, such as PyMuPDF or reportlab. . Here’s a step-by-step approach to help you:

Extract Text Using Prebuilt Read API: Continue using the Prebuilt Read API to extract text from your PDFs.
Overlay Text on PDF: Use a PDF manipulation library to overlay the extracted text onto the original PDF. Here’s a sample code snippet which I haven't tested at my end. You might have to debug it and re-code it further:

import fitz  # PyMuPDF
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Azure Form Recognizer credentials
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

# Initialize the client
client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))

# Function to extract text using Prebuilt Read API
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        poller = client.begin_analyze_document("prebuilt-read", document=f)
    result = poller.result()
    return result

# Function to overlay text on PDF
def overlay_text_on_pdf(pdf_path, result):
    doc = fitz.open(pdf_path)
    
    for page_num, page in enumerate(doc):
        for line in result.pages[page_num].lines:
            for word in line.words:
                # Extract x and y coordinates from bounding box
                bounding_box = word.bounding_box
                x_coords = [bounding_box[i] for i in range(0, len(bounding_box), 2)]
                y_coords = [bounding_box[i + 1] for i in range(0, len(bounding_box), 2)]
                
                # Create a rectangular bounding box that encloses the polygon
                rect = fitz.Rect(min(x_coords), min(y_coords), max(x_coords), max(y_coords))
                
                # Insert the text inside the bounding box
                page.insert_textbox(rect, word.content, fontsize=8, color=(0, 0, 0))
    
    output_path = "output_" + pdf_path
    doc.save(output_path)
    return output_path

# Example usage
pdf_path = "path_to_your_pdf.pdf"
result = extract_text_from_pdf(pdf_path)
output_pdf = overlay_text_on_pdf(pdf_path, result)
print(f"Processed PDF saved at: {output_pdf}")

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Share via

How to OCR a PDF using the prebuilt read API in Python?

1 answer

Your answer