Unable to analyze and extract content from pptx, docx files using Azure Form Recognizer

VIPPALA MADHAVA REDDY 25 Reputation points
2023-12-13T16:50:54.57+00:00

I'm trying to extract the data from the ppt file using form recognizer using Python SDK. But it throws an Unsupported format error.

Sample code used:

endpoint = "<endpoint>"  
key = "<key>"
file_path="<ppt path>"
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open(file_path, "rb") as f:
    poller = document_analysis_client.begin_analyze_document("prebuilt-document", document=f, locale="en-US")
    # Get the result of the analysis
result = poller.result()

Error response:

Message: Invalid request.Inner error: {    "code": "InvalidContent",    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."}

But the same code works for pdf documents.

Does the Form recognizer support these file types (pptx, docx etc.,) at all ? I couldn't find it documented anywhere that I've looked so far.

Is it possible analyze and extract data from these file types? If not what are all the supported file types?

And any workarounds to extract data from these file types - pptx, docx?

References:

Thanks in advance.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,403 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. VasaviLankipalle-MSFT 14,576 Reputation points
    2023-12-14T05:40:48.6033333+00:00

    Hello @VIPPALA MADHAVA REDDY , Thanks for using Microsoft Q&A Platform.

    Unfortunately, the pptX, docX file format is not supported for prebuilt-document model. Only few models do support these file format. The supported file formats are listed here please review this: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-general-document?view=doc-intel-3.1.0#input-requirements

    User's image

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    0 comments No comments

  2. Pavan Kumar 0 Reputation points Microsoft Employee
    2024-01-12T13:03:25.94+00:00
    with open("Open_AI_Studio_Initiative.pptx", "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            "prebuilt-layout", document=f
        )
        result = poller.result()
    
    file_content = ""
    for paragraph in result.paragraphs:
        file_content = file_content + " " + paragraph.content
    
    print("Extracted text:", file_content)
    

    Even with prebuilt-read and prebuilt-layout it's throws an error, Any workaround ?

    Inner error: {
        "code": "InvalidContent",
        "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
    }