How to extract the data from All Pages?

Question

How to extract the data from All Pages?

Shuvojit Das 20

Hi,
I want to extract data from all pages of the pdf. But I couldn't able to do it. Using this piece of code i just able to extract the first two pages data but as per my knowledge if we choose pages_to_analyze = None it should extract the all pages data but as per my case its working. Can anyone help me to fix this issue?

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
from tabulate import tabulate
import json
import os
endpoint = " "
key = " "
local_document_path = r" "
document_name = os.path.splitext(os.path.basename(local_document_path))[0]
def analyze_document(endpoint, key, local_document_path):
    try:
        # Read the document content
        with open(local_document_path, "rb") as f:
            document_content = f.read()
            
        
        # Initialize DocumentAnalysisClient
        document_analysis_client = DocumentAnalysisClient(
            endpoint=endpoint, credential=AzureKeyCredential(key)
        )
        pages_to_analyze = None
        
        # Begin the analyze document operation
        poller = 
document_analysis_client.begin_analyze_document
(
            "prebuilt-document", document=document_content, pages = pages_to_analyze
        )
        result = poller.result()
        
        # Extract key-value pairs from the analyzed result
        key_value_pairs = []
        for kv_pair in result.key_value_pairs:
            if kv_pair.key and kv_pair.value:
                key_value_pairs.append({"Key": kv_pair.key.content, "Value": kv_pair.value.content})
        # Extract tables from the analyzed result
        tables = []
        for table in result.tables:
            doc_table_dict = table.to_dict()
            headers = [cell['column_index'] for cell in doc_table_dict['cells'] if cell['row_index'] == 0]
            rows = [[cell['content'] for cell in doc_table_dict['cells'] if cell['row_index'] == i] for i in range(1, doc_table_dict['row_count'])]
            tables.append({"Headers": headers, "Rows": rows})
        # Combine key-value pairs and tables into a single JSON structure
        result_json = {"Key-Value Pairs": key_value_pairs, "Tables": tables}
        return result_json
    except Exception as e:
        return {"Error": str(e)}
# Call the function
result_json = analyze_document(endpoint, key, local_document_path)
json_file_path = f" "
# Save the JSON structure to a file
with open(json_file_path, "w") as json_file:
    json.dump(result_json, json_file, indent=2)
# Print a message indicating the file has been saved
print(f"JSON structure saved to {json_file_path}")

dupammi 8,615 Reputation points Microsoft External Staff

2024-02-15T04:11:41.1033333+00:00

Hi @Shuvojit Das
Following up to see if you got a chance to check my below response.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Accepted answer

0 additional answers

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2024-02-15T04:11:41.1033333+00:00

Hi @Shuvojit Das
Following up to see if you got a chance to check my below response.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Answer 1

Hi @Shuvojit Das

Thank you for reaching out to the Microsoft Q&A forum and for providing your code snippet.

I understand that you're looking to extract data from all pages of the pdf using Azure AI Document Intelligence. However, you were only able to extract data from the first two pages. I see that the code you were using has a parameter pages_to_analyze set to None To resolve this issue, if you are using Free tier pricing, then the limit for max number of pages for Analysis per document is 2. Try to upgrade to the standard pricing tier.

User's image

Here is the reproduction of the process I successfully executed on a 3-page PDF using the code snippet provided in the question thread. User's image

Output File:
User's image

Refer to this python reference for document intelligence.

Hope this helps.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Shuvojit Das 20 Reputation points

2024-02-15T05:39:00.0366667+00:00

I will upgrade the tier and let you know.
Thanks.

Share via

How to extract the data from All Pages?

0 additional answers

Your answer