Share via

Submit document blob path to Azure Document Intelligence service?

Dilip Jain 20 Reputation points
2025-12-11T05:43:10.44+00:00

I'm using the Azure AI Document Intelligence Python SDK (azure-ai-documentintelligence) to analyze documents stored in Azure Blob Storage. My current workflow involves:

  1. Downloading the document from Blob Storage to my application
  2. For large documents (100+ pages), splitting the PDF into chunks in memory
  3. Sending each chunk's bytes to the Document Intelligence service

Question:

  1. Can I pass an Azure Blob URL directly to begin_analyze_document for single document analysis instead of downloading and uploading the file bytes? I'd like to provide a blob URL (with SAS token or Managed Identity) and have Document Intelligence fetch the document directly.
  2. Does the service support page range parameters (e.g., pages="1-100") so I can analyze specific pages without splitting the PDF myself? This would allow me to process a long document in ranges without downloading/chunking.
  3. Is there a single-document equivalent to AzureBlobContentSource used in batch processing? I know begin_analyze_batch_documents supports blob sources, but it seems like overkill for single document analysis.Is it possible to directly pass the Azure Blob Storage URL/path to the Document Intelligence service instead of downloading and uploading the file content? I want to avoid the intermediate step of fetching the blob content to my application before sending it to Document Intelligence.
Azure Document Intelligence in Foundry Tools
0 comments No comments

2 answers

Sort by: Most helpful
  1. Dilip Kumar Jain 5 Reputation points
    2025-12-29T11:56:49.59+00:00

    While the answer provided by the previous responder is technically correct regarding the Python SDK, I've discovered that the underlying REST API does support passing Azure Blob URLs directly for single-document analysis. Here's how you can achieve this:

    Key Findings

    The SDK limitation is real, but the REST API supports it: The begin_analyze_document method in the Python SDK (azure-ai-documentintelligence) only accepts file bytes/streams or base64-encoded content. However, the REST API accepts a urlSource parameter that allows you to pass a Blob URL (with SAS token) directly.

    Page ranges are fully supported: You can use the pages query parameter (e.g., pages=1-100) to analyze specific page ranges, and the service will only process those pages.

    Working Solution

    Here's a complete example demonstrating how to pass an Azure Blob Storage URL directly to the Document Intelligence REST API:

    import requests
    import time
    from retry import retry
    
    
    class AzureFormRecognizer:
        def __init__(self, endpoint: str, key: str):
            self.endpoint = endpoint.rstrip('/')
            self.key = key
    
        @retry(tries=5, delay=5, jitter=2, backoff=2)
        def analyze_document_from_url(self, document_url: str, model_id: str = "prebuilt-document", 
                                       pages: str = None, output_format: str = None):
            """
            Analyze a document directly from an Azure Blob URL using the REST API.
            
            Args:
                document_url: Full URL to the document (including SAS token if required)
                model_id: The model to use (e.g., "prebuilt-document", "prebuilt-layout", "prebuilt-read")
                pages: Optional page range (e.g., "1-100", "1,3,5-10")
                output_format: Optional output format (e.g., "markdown")
            
            Returns:
                Operation location URL for polling results
            """
            headers = {
                "Content-Type": "application/json",
                "Ocp-Apim-Subscription-Key": self.key
            }
            
            # Build the API URL with optional query parameters
            url = f"{self.endpoint}/formrecognizer/documentModels/{model_id}:analyze?api-version=2023-07-31"
            
            if pages:
                url += f"&pages={pages}"
            if output_format:
                url += f"&outputContentFormat={output_format}"
            
            # Pass the blob URL directly using urlSource
            data = {"urlSource": document_url}
            
            response = requests.post(url=url, headers=headers, json=data, timeout=600)
            
            if response.status_code == 202:
                return response.headers["Operation-Location"]
            else:
                raise ConnectionError(f"Document processing initiation failed: {response.status_code} - {response.text}")
    
        @retry(exceptions=TimeoutError, tries=3, delay=5)
        def get_analyze_result(self, operation_location: str, timeout: int = 300):
            """
            Poll for and retrieve the analysis result.
            
            Args:
                operation_location: The Operation-Location URL returned from analyze_document_from_url
                timeout: Maximum time to wait for results (in seconds)
            
            Returns:
                The analyzeResult dictionary containing the document analysis
            """
            headers = {"Ocp-Apim-Subscription-Key": self.key}
            elapsed = 0
            poll_interval = 5
            
            while elapsed < timeout:
                response = requests.get(operation_location, headers=headers, timeout=60)
                
                if response.status_code != 200:
                    raise RuntimeError(f"Failed to get result: {response.status_code} - {response.text}")
                
                data = response.json()
                status = data.get("status")
                
                if status == "succeeded":
                    return data.get("analyzeResult")
                elif status == "failed":
                    error = data.get("error", {})
                    raise RuntimeError(f"Document analysis failed: {error.get('message', 'Unknown error')}")
                
                # Status is "running" or "notStarted" - continue polling
                time.sleep(poll_interval)
                elapsed += poll_interval
            
            raise TimeoutError(f"Document analysis timed out after {timeout} seconds")
    
    
    # Usage Example
    if __name__ == "__main__":
        # Your Azure Document Intelligence endpoint and key
        ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
        KEY = "your-api-key"
        
        # Your blob URL with SAS token
        BLOB_URL = "https://yourstorageaccount.blob.core.windows.net/container/document.pdf"
        SAS_TOKEN = "?sv=2025-07-05&spr=https&..."  # Your SAS token
        
        document_url = BLOB_URL + SAS_TOKEN
        
        fr = AzureFormRecognizer(endpoint=ENDPOINT, key=KEY)
        
        # Analyze pages 1-50 directly from blob storage
        operation_location = fr.analyze_document_from_url(
            document_url=document_url,
            model_id="prebuilt-document",
            pages="1-50"  # Optional: specify page range
        )
        
        print(f"Processing started: {operation_location}")
        
        result = fr.get_analyze_result(operation_location)
        
        print(f"API Version: {result['apiVersion']}")
        print(f"Model ID: {result['modelId']}")
        print(f"Pages analyzed: {len(result['pages'])}")
    
    API Request
    • endpoint:
      POST {endpoint}/formrecognizer/documentModels/{modelId}:analyze?api-version=2023-07-31&pages={pageRange}
    • request body: {"urlSource":"https://yourstorageaccount.blob.core.windows.net/container/document.pdf?{SAS_TOKEN}"}
    Benefits of This Approach
    Feature SDK (begin_analyze_document) REST API (urlSource)
    Pass Blob URL directly ❌ Not supported ✅ Supported
    Page range support ✅ Supported ✅ Supported
    Avoids download/upload ❌ Must download first ✅ Service fetches directly
    Network efficiency Lower (double transfer) Higher (single transfer)
    Note:
    1. SAS Token or Public Access or Managed Identity can be used.
    2. Page Ranges: The pages parameter accepts various formats:
      • Single page: pages=1
      • Range: pages=1-100
      • Multiple ranges: pages=1-10,15,20-30
    3. Service Limits Still Apply: While this avoids the download/upload step, the document size limits of the service still apply. I was unable to get a 2000-page long pdf work even with page range of a single page.
    References:
    1. Using Azure Document Intelligence - REST API
    2. Shared access signature (SAS) tokens for storage containers
    3. Configure secure access with managed identities

    Was this answer helpful?

    1 person found this answer helpful.

  2. SRILAKSHMI C 19,005 Reputation points Microsoft External Staff Moderator
    2025-12-18T09:47:52.6166667+00:00

    Hello Dilip Jain,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    I understand that you're working on optimizing your workflow for analyzing documents with Azure AI Document Intelligence, and you have a few great questions. Let’s break down your queries:

    1. Passing an Azure Blob URL directly for single-document analysis

    For single document analysis, Azure Document Intelligence does not currently support passing an Azure Blob URL (with SAS or Managed Identity) directly to begin_analyze_document.

    That API only accepts:

    • File bytes / file streams, or
    • Base64-encoded document content

    So today, for single-document calls, the service cannot fetch the file directly from Blob Storage. The download-then-upload step is required.

    Direct blob access is supported only for batch APIs, not for single-document analysi

    2. Analyzing specific page ranges (e.g., pages="1-100")

    Page range parameters are supported in Azure Document Intelligence, but with an important limitation.

    You can specify page ranges using the pages parameter (for example: "1-10", "11-20"), which controls which pages are analyzed by the service.

    However:

    • The entire document must still be uploaded to Azure Document Intelligence
    • Page ranges do not allow partial uploads or bypass downloading/uploading the full file
    • If the document exceeds the service limits (for example, size limits), specifying a page range will not bypass those limits

    Please refer this Azure AI Document Intelligence.

    As a result, while using pages can help reduce processing cost and output size, it does not avoid the need to upload the full document. For very large files that exceed size limits, the document must still be split into smaller files before analysis.

    3.Single-document equivalent of AzureBlobContentSource

    Currently, Azure Document Intelligence does not provide a single-document equivalent of AzureBlobContentSource.

    Blob-based inputs are supported only for batch or asynchronous workflows, such as:

    • begin_analyze_batch_documents
    • Other batch or blob-to-blob processing APIs

    These APIs are designed for processing large document collections and support scenarios where the service reads documents directly from Azure Blob Storage.

    For single-document analysis, the SDK does not expose a way to pass a Blob URL (with SAS or Managed Identity) directly. The supported options are:

    • Upload the document content (bytes/stream) directly to the service, or
    • Use a batch API even for a single document if you want a blob-to-blob workflow

    As a result, for single-document scenarios, downloading and uploading the document content remains the required approach today.

    While these limitations can be a bit challenging, they help ensure the service remains efficient and performs efficiently within its constraints. I would recommend staying updated with any changes in Azure Document Intelligence capabilities, as Microsoft frequently improves their services.

    Also refer this

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.