Hello Vishnu Narayanan,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having issues extracting text from .doc files using Azure Document Intelligence, and it's important to understand that .doc is a legacy format not supported by the service. Azure Document Intelligence only supports modern formats like .docx, PDF, and common image types. Therefore, the best approach is to convert .doc files to .docx or PDF before processing.
To address this, you should first convert the .doc
file using a reliable tool like LibreOffice in headless mode. This ensures the document is transformed into a supported format without requiring manual intervention. Once converted, you must ensure the correct MIME type is used—specifically, application/vnd.openxmlformats-officedocument.wordprocessingml.document
for .docx
files. It's also crucial to use a compatible model such as prebuilt-read
or prebuilt-layout
, as not all models support .docx
. - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0 Additionally, validating the file format before uploading helps prevent runtime errors and improves reliability.
Below is a Python script that automates this process. It will convert a .doc
file to .docx
using LibreOffice, then sends the .docx
file to Azure Document Intelligence for text extraction:
import base64
import io
import subprocess
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
# Define constants for Azure Document Intelligence
AZURE_ENDPOINT = "your_azure_endpoint"
AZURE_DOC_INTELLIGENCE_KEY = "your_azure_key"
class AzureDocumentIntelligenceExtractor:
def __init__(self, model_id: str = "prebuilt-read"):
self.endpoint = AZURE_ENDPOINT
self.key = AZURE_DOC_INTELLIGENCE_KEY
self.document_intelligence_client = DocumentIntelligenceClient(
endpoint=self.endpoint,
credential=AzureKeyCredential(self.key)
)
self.model_id = model_id
def convert_doc_to_docx(self, doc_path: str, output_dir: str):
try:
subprocess.run(['libreoffice', '--headless', '--convert-to', 'docx', doc_path, '--outdir', output_dir], check=True)
print(f"Converted {doc_path} to .docx in {output_dir}")
except subprocess.CalledProcessError as e:
print(f"Error converting {doc_path} to .docx: {e}")
def extract_text_from_docx(self, docx_path: str) -> str:
try:
with open(docx_path, "rb") as docx_file:
docx_bytes = docx_file.read()
stream = io.BytesIO(docx_bytes)
poller = self.document_intelligence_client.begin_analyze_document(
self.model_id,
stream,
content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
)
result = poller.result()
# Extract all text
extracted_text = ""
for page in result.pages:
for line in page.lines:
extracted_text += line.content + "\n"
return extracted_text.strip()
except Exception as e:
print(f"Error extracting text from DOCX: {e}")
return ""
# Example usage
doc_path = "path_to_your_doc_file.doc"
output_dir = "output_directory_path"
docx_path = f"{output_dir}/yourfile.docx"
extractor = AzureDocumentIntelligenceExtractor()
extractor.convert_doc_to_docx(doc_path, output_dir)
extracted_text = extractor.extract_text_from_docx(docx_path)
print(extracted_text)
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.