Hello MS QA Community, I'm encountering an issue with Document Intelligence (DI) where it's not extracting characters accurately. For instance, in the image provided, there's a large heading containing the phrase '2.0M in THF', but DI is extracting 'THF' as 'THE', which is incorrect. The PDF is exceptionally clear, and the letter is clearly an 'F', not an 'E'. This same behavior is occurring consistently across similar files containing this phrase. Here are the specifics: DI API version: GA version 2023-07-31 and same happening in 2024-02-29 (preview) Extraction method: Running layout for custom extraction model This discrepancy is causing significant errors in our data processing, and it's crucial to resolve this issue promptly. Has anyone else encountered similar problems? Any suggestions on how to improve DI's character extraction accuracy? Your input would be greatly appreciated. Thank you.

Issues with Document Intelligence Character Extraction

Syed Umair Hasan 110

Hello MS QA Community,

I'm encountering an issue with Document Intelligence (DI) where it's not extracting characters accurately. For instance, in the image provided, there's a large heading containing the phrase '2.0M in THF', but DI is extracting 'THF' as 'THE', which is incorrect. The PDF is exceptionally clear, and the letter is clearly an 'F', not an 'E'. This same behavior is occurring consistently across similar files containing this phrase.

User's image

Here are the specifics:

DI API version: GA version 2023-07-31 and same happening in 2024-02-29 (preview)
Extraction method: Running layout for custom extraction model

This discrepancy is causing significant errors in our data processing, and it's crucial to resolve this issue promptly. Has anyone else encountered similar problems? Any suggestions on how to improve DI's character extraction accuracy?

Your input would be greatly appreciated.

Thank you.

navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-15T05:10:43.9833333+00:00

@Syed Umair Hasan Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

I just used both the pre-built models for Read and layout and it was able to detect and recognize the text correctly. See below:

Could you please test with the Pre-built models and check if that helps?

If you have any follow-up questions, please let me know. I would be happy to help.
Syed Umair Hasan 110 Reputation points

2024-05-15T18:17:25.5533333+00:00

Hello , so I just tested it again, I am basically training a custom extraction model, when I click on run layout , I am still getting 'E' instead of 'F'.
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-16T09:10:11.88+00:00
@Syed Umair Hasan Thanks for getting back.

Could you please confirm if you have checked the below points for your pdf files:

PDF (text-embedded or scanned). Text-embedded PDFs are best to eliminate the possibility of error in character extraction and location. Scanned PDFs are handled as images.

PDF dimensions are up to 17 x 17 inches, corresponding to Legal or A3 paper size, or smaller.

Awaiting your reply.
Syed Umair Hasan 110 Reputation points

2024-05-16T14:40:53.95+00:00

Hi @navba-MSFT , the PDF is text embedded and not scanned and it is created from a word file and the pdf dimension corresponds too A4 size paper, thanks.
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-17T05:16:50.7733333+00:00

@Syed Umair Hasan Could you please collect the browser HAR trace while running analysis and selecting the text from the document intelligence studio? Share the HAR file here.

Awaiting your reply.
Syed Umair Hasan 110 Reputation points

2024-05-17T16:51:17.3533333+00:00

Hi @navba-MSFT could you please provide an email where I could send the HAR file, thanks!
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-18T02:01:30.8066667+00:00

@Syed Umair Hasan Please send the trace in a private message. I have requested for few more details please provide that in the private message.
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-20T04:16:38.6+00:00

@Syed Umair Hasan This is a quick follow-up to check if you had a chance to share the HAR trace and a few more details over private message. Awaiting your reply.
Syed Umair Hasan 110 Reputation points

2024-05-20T19:19:24.57+00:00

Hi @navba-MSFT , done!
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-21T16:11:26.79+00:00

@Syed Umair Hasan Thanks for sharing the requested details. I have involved the Product Owners and shared these details with them. I will keep you posted about the updates.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-27T05:55:17.94+00:00

@Syed Umair Hasan I am yet to hear back from the Product Owners. I am following up with them for updates. I will get back to you once I hear from them.
navba-MSFT 17,900 Reputation points Microsoft Employee

2024-05-28T06:02:14.9566667+00:00

@Syed Umair Hasan Apologies for the delay in getting back. I appreciate your patience on this.

I have heard from the Product owners. I am sharing their analysis here as received:

.

We apologize for any inconvenience this may have caused. This is not a bug but a system limitation. The root cause lies in the design of the current recognizer system. The system takes language probability into account when decoding recognition results. For example, "THE" is much more likely in the language model than "THF," and since "E" and "F" only have minor visual differences, the output may not be stable.

.

In the near term, we recommend that customers check the confidence score of the words they are interested in. If the confidence is very low, we suggest using human-in-the-loop to correct the results. In the longer term, we are conducting research to eliminate legacy language-dependent decoding. However, this is still in the research phase and will not be included in the upcoming GA release.

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to answer it.

Share via

Issues with Document Intelligence Character Extraction