How to extract the data from All Pages?

Shuvojit Das 20 Reputation points
2024-02-14T11:56:29.31+00:00

Hi,
I want to extract data from all pages of the pdf. But I couldn't able to do it. Using this piece of code i just able to extract the first two pages data but as per my knowledge if we choose pages_to_analyze = None it should extract the all pages data but as per my case its working. Can anyone help me to fix this issue?

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
from tabulate import tabulate
import json
import os
endpoint = " "
key = " "
local_document_path = r" "
document_name = os.path.splitext(os.path.basename(local_document_path))[0]
def analyze_document(endpoint, key, local_document_path):
    try:
        # Read the document content
        with open(local_document_path, "rb") as f:
            document_content = f.read()
            
        
        # Initialize DocumentAnalysisClient
        document_analysis_client = DocumentAnalysisClient(
            endpoint=endpoint, credential=AzureKeyCredential(key)
        )
        pages_to_analyze = None
        
        # Begin the analyze document operation
        poller = 
document_analysis_client.begin_analyze_document
(
            "prebuilt-document", document=document_content, pages = pages_to_analyze
        )
        result = poller.result()
        
        # Extract key-value pairs from the analyzed result
        key_value_pairs = []
        for kv_pair in result.key_value_pairs:
            if kv_pair.key and kv_pair.value:
                key_value_pairs.append({"Key": kv_pair.key.content, "Value": kv_pair.value.content})
        # Extract tables from the analyzed result
        tables = []
        for table in result.tables:
            doc_table_dict = table.to_dict()
            headers = [cell['column_index'] for cell in doc_table_dict['cells'] if cell['row_index'] == 0]
            rows = [[cell['content'] for cell in doc_table_dict['cells'] if cell['row_index'] == i] for i in range(1, doc_table_dict['row_count'])]
            tables.append({"Headers": headers, "Rows": rows})
        # Combine key-value pairs and tables into a single JSON structure
        result_json = {"Key-Value Pairs": key_value_pairs, "Tables": tables}
        return result_json
    except Exception as e:
        return {"Error": str(e)}
# Call the function
result_json = analyze_document(endpoint, key, local_document_path)
json_file_path = f" "
# Save the JSON structure to a file
with open(json_file_path, "w") as json_file:
    json.dump(result_json, json_file, indent=2)
# Print a message indicating the file has been saved
print(f"JSON structure saved to {json_file_path}")
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,102 questions
{count} votes

Accepted answer
  1. dupammi 8,615 Reputation points Microsoft External Staff
    2024-02-14T13:47:20.0033333+00:00

    Hi @Shuvojit Das

    Thank you for reaching out to the Microsoft Q&A forum and for providing your code snippet.

    I understand that you're looking to extract data from all pages of the pdf using Azure AI Document Intelligence. However, you were only able to extract data from the first two pages. I see that the code you were using has a parameter pages_to_analyze set to None To resolve this issue, if you are using Free tier pricing, then the limit for max number of pages for Analysis per document is 2. Try to upgrade to the standard pricing tier.

    User's image

    Here is the reproduction of the process I successfully executed on a 3-page PDF using the code snippet provided in the question thread. User's image

    Output File:
    User's image

    Refer to this python reference for document intelligence.

    Hope this helps.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.