Document Intelligence error when trying to output markdown (python, langchain SDK)

Question

Document Intelligence error when trying to output markdown (python, langchain SDK)

Dadfar, Reza 0

Hi Microsoft Support Team,

I've encountered an issue while trying to analyze a publicly available document using both Document Intelligence (studio) and the Python SDK, following the example provided in your GitHub repository (https://github.com/microsoft/Form-Recognizer-Toolkit/blob/main/SampleCode/Python/sample_rag_langchain.ipynb). The document in question is available at this URL: "https://www.orica.com/ArticleDocuments/301/FY2023%20Annual%20Report.pdf.aspx". Upon attempting the analysis, I received the following error message:

Code: InternalServerError
Message: An unexpected error occurred. Exception Details: (FailedToSerializeAnalyzeResult) Failed to serialize analyze results, please contact support.
Code: FailedToSerializeAnalyzeResult
Message: Failed to serialize analyze results, please contact support.

I would greatly appreciate any assistance or guidance you could provide to resolve this issue. Additionally, I have two follow-up questions:

Is there a way to directly extract the Markdown file from the Document Intelligence studio without using the SDK?
The LangChain API and the example provided seem to work only with individual files. Is there an API available for processing folders containing several files?

Thank you for your help.

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-03-04T22:00:47.1433333+00:00

Hello @Dadfar, Reza , thank you for your response. Yes, looks like for few pages range its working and for other it's not. I tried with page range: 6-9 and it worked.

Will share this feedback to the product team and will let you know once we have an ETA for this fix.

1 answer

Your answer

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-03-04T22:00:47.1433333+00:00

Hello @Dadfar, Reza , thank you for your response. Yes, looks like for few pages range its working and for other it's not. I tried with page range: 6-9 and it worked.

Will share this feedback to the product team and will let you know once we have an ETA for this fix.

Answer 1

Hello @Dadfar, Reza , Thanks for using Microsoft Q&A Platform.

Yes, this is an ongoing issue with the output markdown when using Document Intelligence Studio or Python SDK to analyze a PDF document using prebuilt-layout model with specific page or page range.

The latest update from the product team today is it's started to work after a fix. I have reproduced the same with your sample document by specifying page range and able to get the Markdown output results. I would request you to try the same on your end.

User's image

Regarding your question,

Is there a way to directly extract the Markdown file from the Document Intelligence studio without using the SDK?

As shown in the screenshot here, you can either copy the data or download the JSON result from the studio and extract required data from there. This should help.

The LangChain API and the example provided seem to work only with individual files. Is there an API available for processing folders containing several files?

Regarding this LangChain API, we don't have much information. Maybe you can raise this issue here for better assistance on the same. https://github.com/Azure-Samples/function-python-ai-langchain/issues

I hope this helps.

Regards,

Vasavi

-Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

Dadfar, Reza 0 Reputation points

2024-03-04T05:57:10.5966667+00:00

Hi,

Thank you for your reply.

On my end, it seems the problem occurs on pages 21-30, where I encounter the error. Could you please try that and see if you experience the same issue?
Aarti B 0 Reputation points

2024-03-07T12:22:01.7266667+00:00

Can you please provide this fix, we are also trying getting same issue.

Share via

Document Intelligence error when trying to output markdown (python, langchain SDK)

1 answer

Your answer