Thank you for your response and also for giving more details.
Based on the information you provided, to improve the efficiency of document analysis you try utilizing batch processing capabilities provided by Azure Cognitive Services Form Recognizer. Instead of analyzing each page individually using PyPDF2, try to submit the entire document for analysis at once. This approach might significantly reduce the processing time and can improve performance, especially for larger documents for your use-case. Additionally, it simplifies the extraction process by allowing the service to automatically associate the extracted text with the corresponding page numbers, resulting in a more optimal and faster extraction process compared to the page-by-page approach.
Below is the repro I tried at my end on a 4-page PDF document which took CPU times: total: 359 ms and Wall time: 20.5 s.
Python code that worked for me.
Please modify accordingly, as per your use case.
%%time
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Set up the client
endpoint = 'http://localhost:5000/'
credential = AzureKeyCredential('YOUR_AZURE_KEY')
client = DocumentAnalysisClient(endpoint, credential)
# Define the list to hold document data
documents = []
# Read the PDF file
with open('YOUR_PATH_TO_PDF_FILE', 'rb') as file:
file_data = file.read()
# Add the document data to the list
documents.append({"id": "1", "data": file_data})
# Analyze each document in the list
for document in documents:
document_data = document["data"]
document_id = document["id"]
# Analyze the document
poller = client.begin_analyze_document(model_id='prebuilt-read', document=document_data, api_version="2022-08-31")
# Wait for the analysis to complete
result = poller.result()
# Extract the text and page numbers
content_total = ''
for page in result.pages:
page_number = page.page_number
page_text = ''.join([line.content for line in page.lines])
content_total += f'<Page {page_number:03d}>' + page_text
print(f"Document ID: {document_id}")
print(content_total)
The repro code above uses the begin_analyze_document
method of the DocumentAnalysisClient
class to analyze the entire PDF file in one call, which is expected to be faster than calling it for each page. Additionally, the code extracts the text using the content
attribute of the FormLine
class instead of the extractText
method of the PyPDF2.pdf.PageObject` class. Finally, the code adds the page number to the beginning of the text using f-strings, which is a more readable way of formatting strings.
Output:
I hope you understand. Thank you.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.