Issues with Document Intelligence Character Extraction

Syed Umair Hasan 110 Reputation points
2024-05-14T23:09:46.8766667+00:00

Hello MS QA Community,

I'm encountering an issue with Document Intelligence (DI) where it's not extracting characters accurately. For instance, in the image provided, there's a large heading containing the phrase '2.0M in THF', but DI is extracting 'THF' as 'THE', which is incorrect. The PDF is exceptionally clear, and the letter is clearly an 'F', not an 'E'. This same behavior is occurring consistently across similar files containing this phrase.

User's image

Here are the specifics:

  • DI API version: GA version 2023-07-31 and same happening in 2024-02-29 (preview)
  • Extraction method: Running layout for custom extraction model

This discrepancy is causing significant errors in our data processing, and it's crucial to resolve this issue promptly. Has anyone else encountered similar problems? Any suggestions on how to improve DI's character extraction accuracy?

Your input would be greatly appreciated.

Thank you.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,447 questions
{count} votes