Using Document Intelligence to create chunks for index - how do I extract the related page number of the pdf

Question

Using Document Intelligence to create chunks for index - how do I extract the related page number of the pdf

Joachim Albertsson 50

I am using a python script to read PDFs (in a blob storage). PDF are read with DocumentIntelligenceClient and with prebuilt-read model. The extracted data are cut into chunks and I put various meta-data on the chunks for hybrid search but also full URL to pdf and what page number the chunk started on so the copilot add can point reference to that pdf and also the correct page (or almost correct).

I haven't found any good way of extracting the start page for each chunk so applied a more quick and dirty way of calling DocumentIntelligenceClient for each page - adding a page code in the beginning of the extracted data and then merging all extracted data. Then it is chunked based on chunk size/overlap and the page in the chunk (or previous chunk) are removed from the chunk data but added as chunk meta-data.

This works well from the app and user perspective ... but it does increase the time it takes to extract data by 15 times - so not good for the future when adding more data.

Is the a way to run the whole pdf and still extract the related pages?

dupammi 8,615 Reputation points Microsoft External Staff

2024-03-06T06:10:28.8733333+00:00
Hi @Joachim Albertsson

Thank you for using the Microsoft Q&A forum.

To extract the related page number of a PDF using Azure AI Document Intelligence, you can use the pages query parameter to indicate specific page numbers or page ranges for text extraction. The pages parameter is supported for large multi-page PDF documents.

In your case, you can run the whole PDF and extract the related pages by specifying the page numbers for each chunk. You can then add the page number as chunk metadata. This will avoid the need to call DocumentIntelligenceClient for each page, which can significantly reduce the time it takes to extract data.

Here's an example of how to use the pages parameter to extract text from specific pages of a PDF:

pages=1-3,5,7-9

This will extract text from pages 1, 2, 3, 5, 7, 8, and 9 of the PDF.

You can then use the extracted text to create chunks and add the page number as metadata for each chunk.

Please refer doc-intel-4.0.0#data-extraction and HTTP#analyze-document-from-url in your use-case implementation.

I hope this helps! Thank you.
Joachim Albertsson 50 Reputation points

2024-03-07T06:37:10.6233333+00:00
Thanks. But I am already using the page

Example ... reading a 140 page manual

Using python PyPDF2 to extract number of pages (140)

For each page

poller = form_recognizer_client.begin_analyze_document(model, document=file_data, pages=str(page))

Extract the that page content
Add string "<Page 0xx>" at start of content.
Append content to string content_total
(next page)

Resulting in a content_total
<Page 001>Text from page 1 ..... <Page 002>Text from page 2 ... <Page 003>...

And then i chunk content_total (and extract first match of <Page xxx> and remove it from chunk) and add this page to that chunk_metadata (if no page found i use previous page as full chunk is within a page)

This results in a working solution on index level ... but extraction is not optimal (quick n dirty) and takes about 15 times longer than running DocumentIntelligenceClient on the full 140 pages.

Analyzing the poller-result it is for me hard to see how the content could be matched with what page it was taken from in a full run.

I hope above explanation makes any sense
Joachim Albertsson 50 Reputation points

2024-03-07T06:38:57.4966667+00:00

My answer below should probably be a comment to your answer ... sorry
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-08T11:58:05.2766667+00:00

Hi @Joachim Albertsson

Did you get a chance to check on my above latest response yet?

Also let me know if that was helpful and can convert to answer, so that it will be helpful to others in the community having a similar use case.

Thank you.
Joachim Albertsson 50 Reputation points

2024-03-08T15:56:18.68+00:00

Hi dupammi, your code looks promising. Maybe bad timing to ask ... goin on vacation today. Back on Wednesday ... will test it then :)

Thanks
Joachim Albertsson 50 Reputation points

2024-03-13T16:19:32.0933333+00:00
Hi again ... it's super much faster .. .but will have to evaluate a bit.

Update: The problem now is that I lose the paragraphs. But i will continue to evaluate.

Update2: Works now used the code after extraction of paragraphs etc...

Thanks !!!

for paragraph in form_recognizer_results.paragraphs:
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-14T03:02:34.4733333+00:00

Hi @Joachim Albertsson

I'm glad to hear that my suggestions helped improve the performance of your document analysis.

The earlier response, I converted to answer, which might be beneficial to other community members reading this thread as a solution, in case you'd like to accept the answer.

Thank you!

Accepted answer

0 additional answers

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2024-03-06T06:10:28.8733333+00:00

Hi @Joachim Albertsson

Thank you for using the Microsoft Q&A forum.

To extract the related page number of a PDF using Azure AI Document Intelligence, you can use the pages query parameter to indicate specific page numbers or page ranges for text extraction. The pages parameter is supported for large multi-page PDF documents.

In your case, you can run the whole PDF and extract the related pages by specifying the page numbers for each chunk. You can then add the page number as chunk metadata. This will avoid the need to call DocumentIntelligenceClient for each page, which can significantly reduce the time it takes to extract data.

Here's an example of how to use the pages parameter to extract text from specific pages of a PDF:

pages=1-3,5,7-9

This will extract text from pages 1, 2, 3, 5, 7, 8, and 9 of the PDF.

You can then use the extracted text to create chunks and add the page number as metadata for each chunk.

Please refer doc-intel-4.0.0#data-extraction and HTTP#analyze-document-from-url in your use-case implementation.

I hope this helps! Thank you.
Joachim Albertsson 50 Reputation points

2024-03-07T06:37:10.6233333+00:00

Thanks. But I am already using the page

Example ... reading a 140 page manual

Using python PyPDF2 to extract number of pages (140)

For each page

poller = form_recognizer_client.begin_analyze_document(model, document=file_data, pages=str(page))

Extract the that page content
Add string "<Page 0xx>" at start of content.
Append content to string content_total
(next page)

Resulting in a content_total
<Page 001>Text from page 1 ..... <Page 002>Text from page 2 ... <Page 003>...

And then i chunk content_total (and extract first match of <Page xxx> and remove it from chunk) and add this page to that chunk_metadata (if no page found i use previous page as full chunk is within a page)

This results in a working solution on index level ... but extraction is not optimal (quick n dirty) and takes about 15 times longer than running DocumentIntelligenceClient on the full 140 pages.

Analyzing the poller-result it is for me hard to see how the content could be matched with what page it was taken from in a full run.

I hope above explanation makes any sense
Joachim Albertsson 50 Reputation points

2024-03-07T06:38:57.4966667+00:00

My answer below should probably be a comment to your answer ... sorry
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-08T11:58:05.2766667+00:00

Hi @Joachim Albertsson

Did you get a chance to check on my above latest response yet?

Also let me know if that was helpful and can convert to answer, so that it will be helpful to others in the community having a similar use case.

Thank you.
Joachim Albertsson 50 Reputation points

2024-03-08T15:56:18.68+00:00

Hi dupammi, your code looks promising. Maybe bad timing to ask ... goin on vacation today. Back on Wednesday ... will test it then :)

Thanks
Joachim Albertsson 50 Reputation points

2024-03-13T16:19:32.0933333+00:00

Hi again ... it's super much faster .. .but will have to evaluate a bit.

Update: The problem now is that I lose the paragraphs. But i will continue to evaluate.

Update2: Works now used the code after extraction of paragraphs etc...

Thanks !!!

for paragraph in form_recognizer_results.paragraphs:
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-14T03:02:34.4733333+00:00

Hi @Joachim Albertsson

I'm glad to hear that my suggestions helped improve the performance of your document analysis.

The earlier response, I converted to answer, which might be beneficial to other community members reading this thread as a solution, in case you'd like to accept the answer.

Thank you!

Answer 1

Hi @Joachim Albertsson

Thank you for your response and also for giving more details.

Based on the information you provided, to improve the efficiency of document analysis you try utilizing batch processing capabilities provided by Azure Cognitive Services Form Recognizer. Instead of analyzing each page individually using PyPDF2, try to submit the entire document for analysis at once. This approach might significantly reduce the processing time and can improve performance, especially for larger documents for your use-case. Additionally, it simplifies the extraction process by allowing the service to automatically associate the extracted text with the corresponding page numbers, resulting in a more optimal and faster extraction process compared to the page-by-page approach.

Below is the repro I tried at my end on a 4-page PDF document which took CPU times: total: 359 ms and Wall time: 20.5 s.

Python code that worked for me.

Please modify accordingly, as per your use case.

%%time
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Set up the client
endpoint = 'http://localhost:5000/'
credential = AzureKeyCredential('YOUR_AZURE_KEY')
client = DocumentAnalysisClient(endpoint, credential)
# Define the list to hold document data
documents = []
# Read the PDF file
with open('YOUR_PATH_TO_PDF_FILE', 'rb') as file:
    file_data = file.read()
# Add the document data to the list
documents.append({"id": "1", "data": file_data})
# Analyze each document in the list
for document in documents:
    document_data = document["data"]
    document_id = document["id"]
    
    # Analyze the document
    poller = client.begin_analyze_document(model_id='prebuilt-read', document=document_data, api_version="2022-08-31")
    # Wait for the analysis to complete
    result = poller.result()
    # Extract the text and page numbers
    content_total = ''
    for page in result.pages:
        page_number = page.page_number
        page_text = ''.join([line.content for line in page.lines])
        content_total += f'<Page {page_number:03d}>' + page_text
    print(f"Document ID: {document_id}")
    print(content_total)

The repro code above uses the begin_analyze_document method of the DocumentAnalysisClient class to analyze the entire PDF file in one call, which is expected to be faster than calling it for each page. Additionally, the code extracts the text using the content attribute of the FormLine class instead of the extractText method of the PyPDF2.pdf.PageObject` class. Finally, the code adds the page number to the beginning of the text using f-strings, which is a more readable way of formatting strings.

Output:
User's image

I hope you understand. Thank you.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Using Document Intelligence to create chunks for index - how do I extract the related page number of the pdf

0 additional answers

Your answer