Unable to analyze and extract content from pptx, docx files using Azure Form Recognizer

VIPPALA MADHAVA REDDY 50 Reputation points
2023-12-13T16:50:54.57+00:00

I'm trying to extract the data from the ppt file using form recognizer using Python SDK. But it throws an Unsupported format error.

Sample code used:

endpoint = "<endpoint>"  
key = "<key>"
file_path="<ppt path>"
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open(file_path, "rb") as f:
    poller = document_analysis_client.begin_analyze_document("prebuilt-document", document=f, locale="en-US")
    # Get the result of the analysis
result = poller.result()

Error response:

Message: Invalid request.Inner error: {    "code": "InvalidContent",    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."}

But the same code works for pdf documents.

Does the Form recognizer support these file types (pptx, docx etc.,) at all ? I couldn't find it documented anywhere that I've looked so far.

Is it possible analyze and extract data from these file types? If not what are all the supported file types?

And any workarounds to extract data from these file types - pptx, docx?

References:

Thanks in advance.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,081 questions
0 comments No comments
{count} vote

Accepted answer
  1. VasaviLankipalle-MSFT 18,676 Reputation points Moderator
    2023-12-14T05:40:48.6033333+00:00

    Hello @VIPPALA MADHAVA REDDY , Thanks for using Microsoft Q&A Platform.

    Unfortunately, the pptX, docX file format is not supported for prebuilt-document model. Only few models do support these file format. The supported file formats are listed here please review this: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-general-document?view=doc-intel-3.1.0#input-requirements

    User's image

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Pavan Kumar 10 Reputation points Microsoft Employee
    2024-01-12T13:03:25.94+00:00
    with open("Open_AI_Studio_Initiative.pptx", "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            "prebuilt-layout", document=f
        )
        result = poller.result()
    
    file_content = ""
    for paragraph in result.paragraphs:
        file_content = file_content + " " + paragraph.content
    
    print("Extracted text:", file_content)
    

    Even with prebuilt-read and prebuilt-layout it's throws an error, Any workaround ?

    Inner error: {
        "code": "InvalidContent",
        "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
    }
    
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.