An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
While the answer provided by the previous responder is technically correct regarding the Python SDK, I've discovered that the underlying REST API does support passing Azure Blob URLs directly for single-document analysis. Here's how you can achieve this:
Key Findings
The SDK limitation is real, but the REST API supports it: The begin_analyze_document method in the Python SDK (azure-ai-documentintelligence) only accepts file bytes/streams or base64-encoded content. However, the REST API accepts a urlSource parameter that allows you to pass a Blob URL (with SAS token) directly.
Page ranges are fully supported: You can use the pages query parameter (e.g., pages=1-100) to analyze specific page ranges, and the service will only process those pages.
Working Solution
Here's a complete example demonstrating how to pass an Azure Blob Storage URL directly to the Document Intelligence REST API:
import requests
import time
from retry import retry
class AzureFormRecognizer:
def __init__(self, endpoint: str, key: str):
self.endpoint = endpoint.rstrip('/')
self.key = key
@retry(tries=5, delay=5, jitter=2, backoff=2)
def analyze_document_from_url(self, document_url: str, model_id: str = "prebuilt-document",
pages: str = None, output_format: str = None):
"""
Analyze a document directly from an Azure Blob URL using the REST API.
Args:
document_url: Full URL to the document (including SAS token if required)
model_id: The model to use (e.g., "prebuilt-document", "prebuilt-layout", "prebuilt-read")
pages: Optional page range (e.g., "1-100", "1,3,5-10")
output_format: Optional output format (e.g., "markdown")
Returns:
Operation location URL for polling results
"""
headers = {
"Content-Type": "application/json",
"Ocp-Apim-Subscription-Key": self.key
}
# Build the API URL with optional query parameters
url = f"{self.endpoint}/formrecognizer/documentModels/{model_id}:analyze?api-version=2023-07-31"
if pages:
url += f"&pages={pages}"
if output_format:
url += f"&outputContentFormat={output_format}"
# Pass the blob URL directly using urlSource
data = {"urlSource": document_url}
response = requests.post(url=url, headers=headers, json=data, timeout=600)
if response.status_code == 202:
return response.headers["Operation-Location"]
else:
raise ConnectionError(f"Document processing initiation failed: {response.status_code} - {response.text}")
@retry(exceptions=TimeoutError, tries=3, delay=5)
def get_analyze_result(self, operation_location: str, timeout: int = 300):
"""
Poll for and retrieve the analysis result.
Args:
operation_location: The Operation-Location URL returned from analyze_document_from_url
timeout: Maximum time to wait for results (in seconds)
Returns:
The analyzeResult dictionary containing the document analysis
"""
headers = {"Ocp-Apim-Subscription-Key": self.key}
elapsed = 0
poll_interval = 5
while elapsed < timeout:
response = requests.get(operation_location, headers=headers, timeout=60)
if response.status_code != 200:
raise RuntimeError(f"Failed to get result: {response.status_code} - {response.text}")
data = response.json()
status = data.get("status")
if status == "succeeded":
return data.get("analyzeResult")
elif status == "failed":
error = data.get("error", {})
raise RuntimeError(f"Document analysis failed: {error.get('message', 'Unknown error')}")
# Status is "running" or "notStarted" - continue polling
time.sleep(poll_interval)
elapsed += poll_interval
raise TimeoutError(f"Document analysis timed out after {timeout} seconds")
# Usage Example
if __name__ == "__main__":
# Your Azure Document Intelligence endpoint and key
ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
KEY = "your-api-key"
# Your blob URL with SAS token
BLOB_URL = "https://yourstorageaccount.blob.core.windows.net/container/document.pdf"
SAS_TOKEN = "?sv=2025-07-05&spr=https&..." # Your SAS token
document_url = BLOB_URL + SAS_TOKEN
fr = AzureFormRecognizer(endpoint=ENDPOINT, key=KEY)
# Analyze pages 1-50 directly from blob storage
operation_location = fr.analyze_document_from_url(
document_url=document_url,
model_id="prebuilt-document",
pages="1-50" # Optional: specify page range
)
print(f"Processing started: {operation_location}")
result = fr.get_analyze_result(operation_location)
print(f"API Version: {result['apiVersion']}")
print(f"Model ID: {result['modelId']}")
print(f"Pages analyzed: {len(result['pages'])}")
API Request
- endpoint:
POST {endpoint}/formrecognizer/documentModels/{modelId}:analyze?api-version=2023-07-31&pages={pageRange} - request body:
{"urlSource":"https://yourstorageaccount.blob.core.windows.net/container/document.pdf?{SAS_TOKEN}"}
Benefits of This Approach
| Feature | SDK (begin_analyze_document) | REST API (urlSource) |
|---|---|---|
| Pass Blob URL directly | ❌ Not supported | ✅ Supported |
| Page range support | ✅ Supported | ✅ Supported |
| Avoids download/upload | ❌ Must download first | ✅ Service fetches directly |
| Network efficiency | Lower (double transfer) | Higher (single transfer) |
Note:
- SAS Token or Public Access or Managed Identity can be used.
- Page Ranges: The pages parameter accepts various formats:
- Single page:
pages=1 - Range:
pages=1-100 - Multiple ranges:
pages=1-10,15,20-30
- Single page:
- Service Limits Still Apply: While this avoids the download/upload step, the document size limits of the service still apply. I was unable to get a 2000-page long pdf work even with page range of a single page.