Custom Model - how to extract text to have same indentation/ structure as original document

Question

Custom Model - how to extract text to have same indentation/ structure as original document

Ionut Dutescu 60

I have trained a Custom Model to extract all the existent text in the image that contains a PDF. This is how we receive the PDFs as images and we need to extract all the text existent. The Document Intelligence model extracts it row by row, without respecting the original structure. I was wondering if there is any way to extract all the text, and output it in the same format/ indentation, the information extracted should be present in the same position e.g. left-corner date, right-corner email, details, the line items aligned on center. Basically the output text should have the same document structure. Is it possible in some way, or the model just extracts plain text and puts it all in a long long paragraph?

Thanks!

santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-07T08:59:06.6233333+00:00

Hi @Ionut Dutescu,

Following up to see if the above response was helpful. Thank you.
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-08T09:07:02.0033333+00:00

Hi @Ionut Dutescu,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
Ionut Dutescu 60 Reputation points

2024-03-08T11:48:22.7933333+00:00

Hello @santoshkc ,

Thanks for your answers. Unfortunately I did not check the solution yet. I will try to come back with an answer today or no later than Monday.

Thank you,

Have a nice day!
Ionut Dutescu 60 Reputation points

2024-03-14T14:52:32.09+00:00

Hello @santoshkc ,

The method above did not fully help me because it did not extract the text as I needed. Despite this, I managed to get the result I wanted doing something else. I think we can close this post

Thanks for your help and patience,

Have a nice day!

1 answer

Your answer

santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-07T08:59:06.6233333+00:00

Hi @Ionut Dutescu,

Following up to see if the above response was helpful. Thank you.
santoshkc 15,355 Reputation points Microsoft External Staff Moderator

2024-03-08T09:07:02.0033333+00:00

Hi @Ionut Dutescu,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
Ionut Dutescu 60 Reputation points

2024-03-08T11:48:22.7933333+00:00

Hello @santoshkc ,

Thanks for your answers. Unfortunately I did not check the solution yet. I will try to come back with an answer today or no later than Monday.

Thank you,

Have a nice day!
Ionut Dutescu 60 Reputation points

2024-03-14T14:52:32.09+00:00

Hello @santoshkc ,

The method above did not fully help me because it did not extract the text as I needed. Despite this, I managed to get the result I wanted doing something else. I think we can close this post

Thanks for your help and patience,

Have a nice day!

Answer 1

Hi @Ionut Dutescu,

Thank you for reaching out to Microsoft Q&A forum!

To maintain the original document structure while extracting text from PDF images, you can use the "Draw region" option in the Custom model. This option allows you to draw regions around the text you want to extract, preserving the original layout and structure of the document.

To use this option, you need to train your Custom model using the Document Intelligence service. During the training process, you can select the "Draw region" option and draw regions around the text you want to extract. Once the model is trained, you can use it to extract text from PDF images while preserving the original document structure.

It's worth noting that the accuracy of the extracted text will depend on the quality of the PDF images and the accuracy of the regions you draw. You may need to experiment with different region sizes and positions to get the best results.

I hope you understand! Thank you.

Share via

Custom Model - how to extract text to have same indentation/ structure as original document

1 answer

Your answer