Azure Document Intelligence Python SDK Returns Data Only for First Page

Question

Azure Document Intelligence Python SDK Returns Data Only for First Page

Kiran Dhumma 20

Hi - I am encountering an issue while using the azure.ai.documentintelligence Python library to extract data from a PDF using page ranges.

My application processes the document page by page, and this workflow was working correctly until this morning. However, starting today, when I attempt to extract data for specific pages, the service returns result only for the first page, regardless of the page range specified.

There have been no code changes on my side. Has anyone else experienced this issue, or is there a recent service update or known limitation affecting page-range extraction?

Gowtham CP 7,085 Reputation points Volunteer Moderator

2025-12-17T07:27:32.3466667+00:00

Hi Kiran Dhumma

Thanks for the question.

1. SDK usage can cause only the first page to be returned If only the first page is returned, this is often related to how the Python SDK call is made. Unlike older Form Recognizer SDKs, in azure.ai.documentintelligence you must ensure the pages parameter is passed correctly in the analysis request and that your SDK version supports page ranges. The REST API supports this, but the parameter must be set in the correct part of the request.

2. Free tier limitation If your Document Intelligence resource is on the F0 (free) tier, the service analyzes only the first pages of a PDF regardless of the page range specified. Upgrading to S0 (standard) removes this limitation and enables full multi-page extraction. Reference: https://learn.microsoft.com/answers/questions/5569121/azure-document-intelligence-only-analyzing-2-docum

3. Document type matters Page-range behavior is best supported for PDF/TIFF files. For formats like DOCX or XLSX, the SDK may not treat content as page-based, which can appear as if only one page is processed. Reference: https://stackoverflow.com/questions/79475225/azure-documen-intelligence-python-sdk-doesnt-separate-pages

Could you please share the Python SDK code snippet you are using (including the client initialization and analyze call)? That will help confirm whether the request is being sent correctly.

I

Kiran Dhumma 20

Hi - @Gowtham CP here are the details FYR:

SDK & Version: azure-ai-documentintelligence>=1.0.2
Resource tier:
Code snippet:

import json
from typing import Dict, List, Any, Optional
from collections import defaultdict
import logging
from doc_processing.helpers import get_openai_client
from doc_processing.azure_key_vault import get_secret
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.core.credentials import AzureKeyCredential
from doc_processing.constants import AZURE_OPENAI_DEPLOYMENT,INTERNAL_ERROR

# Configure logging
logger = logging.getLogger(__name__)



def get_document_intelligence_client():
    """Initialize Document Intelligence client with credential validation"""
    endpoint = get_secret("AZURE-DOC-INTELLIGENCE-ENDPOINT")
    key = get_secret("AZURE-DOC-INTELLIGENCE-KEY")
    
    if not endpoint or not key:
        raise ValueError("Missing Azure Document Intelligence credentials in environment variables")
    
    return DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

def analyze_form_with_page_range(document_url: str, retry_attempts: int = 3, retry_delay_seconds: int = 1, pages: str = None) -> dict:
    """
    Analyze document using Azure Document Intelligence
    
    Args:
        document_url: SAS URL of the document
        pkg_id: Package ID for logging
        file_id: File ID for logging  
        doc_id: Document ID for logging
        pages: Comma-separated page numbers (optional)
        
    Returns:
        dict: Extracted content, key-value pairs, and tables
    """
    if not document_url:
        raise ValueError("Document URL is required")

    print(f"[DOC_INTEL] Analyzing document at URL for pages: {pages if pages else 'all'}")
    logger.info(f"Analyzing document at {document_url} for pages: {pages}")

    try:
        client = get_document_intelligence_client()
        # delay_strategy = DelayStrategy.CreateExponentialDelayStrategy(initialDelay = default, maxDelay = default)
        # retry_policy = RetryPolicy(maxRetries = 3, )
        # client_options = DocumentIntelligenceClientOptions()

        analyze_request = AnalyzeDocumentRequest(url_source=document_url)

        for attempt in (1, retry_attempts):
            poller = client.begin_analyze_document(
                "prebuilt-layout",
                analyze_request,
                features=["keyValuePairs"],
                pages=pages
            )

            print(f"[DOC_INTEL] Analysis started, waiting for completion...")
            result = poller.result()
            # for i in result:
            #     print(i)
            content = result.content

            if content:
                key_value_pairs = []
                extracted_tables = []
                paragraphs = []

                if hasattr(result, "key_value_pairs") and result.key_value_pairs:
                    for kv in result["keyValuePairs"]:
                        key = kv["key"]["content"] if "key" in kv and kv["key"] else ""
                        value = kv["value"]["content"] if "value" in kv and kv["value"] else ""
                        confidence = kv.get("confidence", None)
                    
                        key_value_pairs.append({
                            "key": key,
                            "value": value,
                            "confidence": confidence
                        })

                # Extract tables (content only, no coordinates)

                if "tables" in result and result["tables"]:
                    for table_obj in result["tables"]:
                        row_count = table_obj.get("rowCount", 0)
                        column_count = table_obj.get("columnCount", 0)
                    
                        table_cells = [["" for _ in range(column_count)] for _ in range(row_count)]
                    
                        if "cells" in table_obj:
                            for cell in table_obj["cells"]:
                                row_index = cell.get("rowIndex", 0)
                                column_index = cell.get("columnIndex", 0)
                                table_cells[row_index][column_index] = cell.get("content", "")
                    
                        extracted_tables.append({
                            "rowCount": row_count,
                            "columnCount": column_count,
                            "rows": table_cells
                        })

                if "paragraphs" in result and result["paragraphs"]:
                    for para in result["paragraphs"]:
                        para_content = para.get("content", "")
                   
                        # Extract bounding regions (polygon and page number)
                        bounding_regions = []
                        if "boundingRegions" in para and para["boundingRegions"]:
                            for region in para["boundingRegions"]:
                                bounding_regions.append({
                                    "pageNumber": region.get("pageNumber"),
                                    "polygon": list(region.get("polygon", []))
                                })
                    
                        paragraphs.append({
                            "content": para_content,
                            "boundingRegions": bounding_regions
                        })

                print(f"[DOC_INTEL] Analysis completed - extracted {len(key_value_pairs)} key-value pairs and {len(extracted_tables)} tables")
                logger.info(f"Document analysis complete for {document_url} pages: {pages}")

                return {
                "content": content,
                "key_value_pairs": key_value_pairs,
                "tables": extracted_tables,
                "paragraphs": paragraphs
                }
            
            else:
                if attempt == retry_attempts:
                    print("[DOC_INTEL] Extraction failed Max retries reached.")
                    return False

                else:
                    print(f"[DOC_INTEL] Extraction failed for attempt:{attempt}")
                    print(f"[DOC_INTEL] Retrying in {retry_delay_seconds} seconds...")
                    # time.sleep(retry_delay_seconds)
    except Exception as e:
        logger.error(f"Analyze document with Azure doc intelligence failed: {e}")


## Execution

openai_client = get_openai_client()

page_range = 4
# counter = 1
json_state = {}

for i in range(1, page_range + 1):
    file_name = "extraction_op_v3." + str(i)
    content = analyze_form_with_page_range("<blob_url>", pages=str(i))
    
    with open(f"{file_name}.json", "w") as f:
        json.dump(content, f, indent=4)
    
    page_status = get_page_status(i, page_range)
    json_state = generate_structured_output(content, json_state, page_status, i)
    # counter += 1

with open("Std_output_v2.1.json", "w") as f:
    json.dump(json_state, f, indent=4)

txn_idx = extract_transaction_page_table_pairs(json_state)
# print(txn_idx)

Kiran Dhumma 20 Reputation points

2025-12-17T10:14:39.1+00:00

Any updates on the issue @Gowtham CP ?
Fremin Abreu Ortega 0 Reputation points

2025-12-17T16:10:14.28+00:00

I'm facing the same issue using C# sdk. It began happening without any code changes, yesterday the same code works fine...

1 answer

Your answer

Gowtham CP 7,085 Reputation points Volunteer Moderator

2025-12-17T07:27:32.3466667+00:00

Hi Kiran Dhumma

Thanks for the question.

1. SDK usage can cause only the first page to be returned If only the first page is returned, this is often related to how the Python SDK call is made. Unlike older Form Recognizer SDKs, in azure.ai.documentintelligence you must ensure the pages parameter is passed correctly in the analysis request and that your SDK version supports page ranges. The REST API supports this, but the parameter must be set in the correct part of the request.

2. Free tier limitation If your Document Intelligence resource is on the F0 (free) tier, the service analyzes only the first pages of a PDF regardless of the page range specified. Upgrading to S0 (standard) removes this limitation and enables full multi-page extraction. Reference: https://learn.microsoft.com/answers/questions/5569121/azure-document-intelligence-only-analyzing-2-docum

3. Document type matters Page-range behavior is best supported for PDF/TIFF files. For formats like DOCX or XLSX, the SDK may not treat content as page-based, which can appear as if only one page is processed. Reference: https://stackoverflow.com/questions/79475225/azure-documen-intelligence-python-sdk-doesnt-separate-pages

Could you please share the Python SDK code snippet you are using (including the client initialization and analyze call)? That will help confirm whether the request is being sent correctly.

I
Kiran Dhumma 20 Reputation points

2025-12-17T10:14:39.1+00:00

Any updates on the issue @Gowtham CP ?
Fremin Abreu Ortega 0 Reputation points

2025-12-17T16:10:14.28+00:00

I'm facing the same issue using C# sdk. It began happening without any code changes, yesterday the same code works fine...

Answer 1

RAHUL SHEDGE 0

Could you please confirm if any resolution or fix has been implemented for this issue?

Kiran Dhumma 20 Reputation points

2025-12-18T13:35:07.3933333+00:00

No not yet, I am still facing this issue.

Share via

Azure Document Intelligence Python SDK Returns Data Only for First Page

1 answer

Your answer