When I try to extract text from a .doc file using Azure Document Intelligence, it's not supported. It works fine with PDFs. I tried extracting text from various documents. How can I fix this?

Vishnu Narayanan 0 Reputation points
2025-05-12T15:18:01.75+00:00
import base64
import io
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
# from config import AZURE_ENDPOINT, AZURE_DOC_INTELLIGENCE_KEY  # adjust if needed

class AzureDocumentIntelligenceExtracter:
    def __init__(self, model_id: str = "prebuilt-read", ):
        self.endpoint = AZURE_ENDPOINT
        self.key = AZURE_DOC_INTELLIGENCE_KEY
        self.document_intelligence_client = DocumentIntelligenceClient(
            endpoint=self.endpoint,
            credential=AzureKeyCredential(self.key)
        )
        self.model_id = model_id

    def extract_text_from_base64_pdf(self, base64_content: str, content_type: str = "application/pdf") -> str:
        try:
            pdf_bytes = base64.b64decode(base64_content)
            stream = io.BytesIO(pdf_bytes)


            poller = self.document_intelligence_client.begin_analyze_document(
                self.model_id,
                stream,
                content_type="application/msword"
            )

            result = poller.result()

            # Extract all text
            extracted_text = ""
            for page in result.pages:
                for line in page.lines:
                    extracted_text += line.content + "\n"

            return extracted_text.strip()

        except Exception as e:
            print("Error extracting text from PDF:", e)
            return ""

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,104 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 22,031 Reputation points Volunteer Moderator
    2025-05-13T17:49:12.22+00:00

    Hello Vishnu Narayanan,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having issues extracting text from .doc files using Azure Document Intelligence, and it's important to understand that .doc is a legacy format not supported by the service. Azure Document Intelligence only supports modern formats like .docx, PDF, and common image types. Therefore, the best approach is to convert .doc files to .docx or PDF before processing.

    To address this, you should first convert the .doc file using a reliable tool like LibreOffice in headless mode. This ensures the document is transformed into a supported format without requiring manual intervention. Once converted, you must ensure the correct MIME type is used—specifically, application/vnd.openxmlformats-officedocument.wordprocessingml.document for .docx files. It's also crucial to use a compatible model such as prebuilt-read or prebuilt-layout, as not all models support .docx. - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0 Additionally, validating the file format before uploading helps prevent runtime errors and improves reliability.

    Below is a Python script that automates this process. It will convert a .doc file to .docx using LibreOffice, then sends the .docx file to Azure Document Intelligence for text extraction:

    import base64
    import io
    import subprocess
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    # Define constants for Azure Document Intelligence
    AZURE_ENDPOINT = "your_azure_endpoint"
    AZURE_DOC_INTELLIGENCE_KEY = "your_azure_key"
    class AzureDocumentIntelligenceExtractor:
        def __init__(self, model_id: str = "prebuilt-read"):
            self.endpoint = AZURE_ENDPOINT
            self.key = AZURE_DOC_INTELLIGENCE_KEY
            self.document_intelligence_client = DocumentIntelligenceClient(
                endpoint=self.endpoint,
                credential=AzureKeyCredential(self.key)
            )
            self.model_id = model_id
        def convert_doc_to_docx(self, doc_path: str, output_dir: str):
            try:
                subprocess.run(['libreoffice', '--headless', '--convert-to', 'docx', doc_path, '--outdir', output_dir], check=True)
                print(f"Converted {doc_path} to .docx in {output_dir}")
            except subprocess.CalledProcessError as e:
                print(f"Error converting {doc_path} to .docx: {e}")
        def extract_text_from_docx(self, docx_path: str) -> str:
            try:
                with open(docx_path, "rb") as docx_file:
                    docx_bytes = docx_file.read()
                    stream = io.BytesIO(docx_bytes)
                    poller = self.document_intelligence_client.begin_analyze_document(
                        self.model_id,
                        stream,
                        content_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
                    )
                    result = poller.result()
                    # Extract all text
                    extracted_text = ""
                    for page in result.pages:
                        for line in page.lines:
                            extracted_text += line.content + "\n"
                    return extracted_text.strip()
            except Exception as e:
                print(f"Error extracting text from DOCX: {e}")
                return ""
    # Example usage
    doc_path = "path_to_your_doc_file.doc"
    output_dir = "output_directory_path"
    docx_path = f"{output_dir}/yourfile.docx"
    extractor = AzureDocumentIntelligenceExtractor()
    extractor.convert_doc_to_docx(doc_path, output_dir)
    extracted_text = extractor.extract_text_from_docx(docx_path)
    print(extracted_text)
    

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.