Unable to analyze and extract content from pptx, docx files using Azure Form Recognizer

Question

I'm trying to extract the data from the ppt file using form recognizer using Python SDK. But it throws an Unsupported format error.

Sample code used:

endpoint = ""  
key = ""
file_path=""
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open(file_path, "rb") as f:
    poller = document_analysis_client.begin_analyze_document("prebuilt-document", document=f, locale="en-US")
    # Get the result of the analysis
result = poller.result()

Error response:

Message: Invalid request.Inner error: {    "code": "InvalidContent",    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."}

But the same code works for pdf documents.

Does the Form recognizer support these file types (pptx, docx etc.,) at all ? I couldn't find it documented anywhere that I've looked so far.

Is it possible analyze and extract data from these file types? If not what are all the supported file types?

And any workarounds to extract data from these file types - pptx, docx?

References:

Getting Started with Document Intelligence

Thanks in advance.

Answer

Hello @VIPPALA MADHAVA REDDY , Thanks for using Microsoft Q&A Platform.

Unfortunately, the pptX, docX file format is not supported for prebuilt-document model. Only few models do support these file format. The supported file formats are listed here please review this: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-general-document?view=doc-intel-3.1.0#input-requirements

User's image

I hope this helps.

Regards,
Vasavi

-Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

Answer

with open("Open_AI_Studio_Initiative.pptx", "rb") as f:
    poller = document_analysis_client.begin_analyze_document(
        "prebuilt-layout", document=f
    )
    result = poller.result()

file_content = ""
for paragraph in result.paragraphs:
    file_content = file_content + " " + paragraph.content

print("Extracted text:", file_content)

Even with prebuilt-read and prebuilt-layout it's throws an error, Any workaround ?

Inner error: {
    "code": "InvalidContent",
    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
}

Unable to analyze and extract content from pptx, docx files using Azure Form Recognizer

2 answers