Issues with Document Intelligence Character Extraction
Hello MS QA Community,
I'm encountering an issue with Document Intelligence (DI) where it's not extracting characters accurately. For instance, in the image provided, there's a large heading containing the phrase '2.0M in THF', but DI is extracting 'THF' as 'THE', which is incorrect. The PDF is exceptionally clear, and the letter is clearly an 'F', not an 'E'. This same behavior is occurring consistently across similar files containing this phrase.
Here are the specifics:
- DI API version: GA version 2023-07-31 and same happening in 2024-02-29 (preview)
- Extraction method: Running layout for custom extraction model
This discrepancy is causing significant errors in our data processing, and it's crucial to resolve this issue promptly. Has anyone else encountered similar problems? Any suggestions on how to improve DI's character extraction accuracy?
Your input would be greatly appreciated.
Thank you.