Using Document Intelligence to create chunks for index - how do I extract the related page number of the pdf

Joachim Albertsson 50 Reputation points
2024-03-05T23:15:07.91+00:00

I am using a python script to read PDFs (in a blob storage). PDF are read with DocumentIntelligenceClient and with prebuilt-read model. The extracted data are cut into chunks and I put various meta-data on the chunks for hybrid search but also full URL to pdf and what page number the chunk started on so the copilot add can point reference to that pdf and also the correct page (or almost correct).

I haven't found any good way of extracting the start page for each chunk so applied a more quick and dirty way of calling DocumentIntelligenceClient for each page - adding a page code in the beginning of the extracted data and then merging all extracted data. Then it is chunked based on chunk size/overlap and the page in the chunk (or previous chunk) are removed from the chunk data but added as chunk meta-data.

This works well from the app and user perspective ... but it does increase the time it takes to extract data by 15 times - so not good for the future when adding more data.

Is the a way to run the whole pdf and still extract the related pages?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,620 questions
{count} votes

Accepted answer
  1. dupammi 8,460 Reputation points Microsoft Vendor
    2024-03-07T09:03:30.66+00:00

    Hi @Joachim Albertsson

    Thank you for your response and also for giving more details.

    Based on the information you provided, to improve the efficiency of document analysis you try utilizing batch processing capabilities provided by Azure Cognitive Services Form Recognizer. Instead of analyzing each page individually using PyPDF2, try to submit the entire document for analysis at once. This approach might significantly reduce the processing time and can improve performance, especially for larger documents for your use-case. Additionally, it simplifies the extraction process by allowing the service to automatically associate the extracted text with the corresponding page numbers, resulting in a more optimal and faster extraction process compared to the page-by-page approach.

    Below is the repro I tried at my end on a 4-page PDF document which took CPU times: total: 359 ms and Wall time: 20.5 s.

    Python code that worked for me.

    Please modify accordingly, as per your use case.

    %%time
    from azure.ai.formrecognizer import DocumentAnalysisClient
    from azure.core.credentials import AzureKeyCredential
    # Set up the client
    endpoint = 'http://localhost:5000/'
    credential = AzureKeyCredential('YOUR_AZURE_KEY')
    client = DocumentAnalysisClient(endpoint, credential)
    # Define the list to hold document data
    documents = []
    # Read the PDF file
    with open('YOUR_PATH_TO_PDF_FILE', 'rb') as file:
        file_data = file.read()
    # Add the document data to the list
    documents.append({"id": "1", "data": file_data})
    # Analyze each document in the list
    for document in documents:
        document_data = document["data"]
        document_id = document["id"]
        
        # Analyze the document
        poller = client.begin_analyze_document(model_id='prebuilt-read', document=document_data, api_version="2022-08-31")
        # Wait for the analysis to complete
        result = poller.result()
        # Extract the text and page numbers
        content_total = ''
        for page in result.pages:
            page_number = page.page_number
            page_text = ''.join([line.content for line in page.lines])
            content_total += f'<Page {page_number:03d}>' + page_text
        print(f"Document ID: {document_id}")
        print(content_total)
    
    

    The repro code above uses the begin_analyze_document method of the DocumentAnalysisClient class to analyze the entire PDF file in one call, which is expected to be faster than calling it for each page. Additionally, the code extracts the text using the content attribute of the FormLine class instead of the extractText method of the PyPDF2.pdf.PageObject` class. Finally, the code adds the page number to the beginning of the text using f-strings, which is a more readable way of formatting strings.

    Output:
    User's image

    I hope you understand. Thank you.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.