When I try to extract text from a .doc file using Azure Document Intelligence, it's not supported. It works fine with PDFs. I tried extracting text from various documents. How can I fix this?

Question

When I try to extract text from a .doc file using Azure Document Intelligence, it's not supported. It works fine with PDFs. I tried extracting text from various documents. How can I fix this?

Vishnu Narayanan 0

import base64
import io
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
# from config import AZURE_ENDPOINT, AZURE_DOC_INTELLIGENCE_KEY  # adjust if needed

class AzureDocumentIntelligenceExtracter:
    def __init__(self, model_id: str = "prebuilt-read", ):
        self.endpoint = AZURE_ENDPOINT
        self.key = AZURE_DOC_INTELLIGENCE_KEY
        self.document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.endpoint,
            credential=AzureKeyCredential(self.key)
        )
        self.model_id = model_id

    def extract_text_from_base64_pdf(self, base64_content: str, content_type: str = "application/pdf") -> str:
        try:
            pdf_bytes = base64.b64decode(base64_content)
            stream = io.BytesIO(pdf_bytes)


            poller = self.document_intelligence_client.begin_analyze_document(
                self.model_id,
                stream,
                content_type="application/msword"
            )

            result = poller.result()

            # Extract all text
            extracted_text = ""
            for page in result.pages:
                for line in page.lines:
                    extracted_text += line.content + "\n"

            return extracted_text.strip()

        except Exception as e:
            print("Error extracting text from PDF:", e)
            return ""

1 answer

Your answer

Answer 1

Hello Vishnu Narayanan,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having issues extracting text from .doc files using Azure Document Intelligence, and it's important to understand that .doc is a legacy format not supported by the service. Azure Document Intelligence only supports modern formats like .docx, PDF, and common image types. Therefore, the best approach is to convert .doc files to .docx or PDF before processing.

To address this, you should first convert the .doc file using a reliable tool like LibreOffice in headless mode. This ensures the document is transformed into a supported format without requiring manual intervention. Once converted, you must ensure the correct MIME type is used—specifically, application/vnd.openxmlformats-officedocument.wordprocessingml.document for .docx files. It's also crucial to use a compatible model such as prebuilt-read or prebuilt-layout, as not all models support .docx. - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0 Additionally, validating the file format before uploading helps prevent runtime errors and improves reliability.

Below is a Python script that automates this process. It will convert a .doc file to .docx using LibreOffice, then sends the .docx file to Azure Document Intelligence for text extraction:

import base64
import io
import subprocess
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
# Define constants for Azure Document Intelligence
AZURE_ENDPOINT = "your_azure_endpoint"
AZURE_DOC_INTELLIGENCE_KEY = "your_azure_key"
class AzureDocumentIntelligenceExtractor:
    def __init__(self, model_id: str = "prebuilt-read"):
        self.endpoint = AZURE_ENDPOINT
        self.key = AZURE_DOC_INTELLIGENCE_KEY
        self.document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.endpoint,
            credential=AzureKeyCredential(self.key)
        )
        self.model_id = model_id
    def convert_doc_to_docx(self, doc_path: str, output_dir: str):
        try:
            subprocess.run(['libreoffice', '--headless', '--convert-to', 'docx', doc_path, '--outdir', output_dir], check=True)
            print(f"Converted {doc_path} to .docx in {output_dir}")
        except subprocess.CalledProcessError as e:
            print(f"Error converting {doc_path} to .docx: {e}")
    def extract_text_from_docx(self, docx_path: str) -> str:
        try:
            with open(docx_path, "rb") as docx_file:
                docx_bytes = docx_file.read()
                stream = io.BytesIO(docx_bytes)
                poller = self.document_intelligence_client.begin_analyze_document(
                    self.model_id,
                    stream,
                    content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
                )
                result = poller.result()
                # Extract all text
                extracted_text = ""
                for page in result.pages:
                    for line in page.lines:
                        extracted_text += line.content + "\n"
                return extracted_text.strip()
        except Exception as e:
            print(f"Error extracting text from DOCX: {e}")
            return ""
# Example usage
doc_path = "path_to_your_doc_file.doc"
output_dir = "output_directory_path"
docx_path = f"{output_dir}/yourfile.docx"
extractor = AzureDocumentIntelligenceExtractor()
extractor.convert_doc_to_docx(doc_path, output_dir)
extracted_text = extractor.extract_text_from_docx(docx_path)
print(extracted_text)

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Ravada Shivaprasad 535 Reputation points Microsoft External Staff Moderator

2025-05-13T21:38:37.5833333+00:00

Hi Vishnu Narayanan

Just checking in to see if the above answer provided by @Sina Salam helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks
Ravada Shivaprasad 535 Reputation points Microsoft External Staff Moderator

2025-05-15T15:52:56.0833333+00:00

Hi Vishnu Narayanan

Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks
Ravada Shivaprasad 535 Reputation points Microsoft External Staff Moderator

2025-05-16T18:45:35.51+00:00

Hi Vishnu Narayanan

Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks

Share via

When I try to extract text from a .doc file using Azure Document Intelligence, it's not supported. It works fine with PDFs. I tried extracting text from various documents. How can I fix this?

1 answer

Your answer