Share via

Document Intelligence PDF text data usage

Martin P 0 Reputation points
2023-11-16T15:19:35.69+00:00

Hi,

I have noticed a difference in behavior between api version 2022-08-31 and 2023-07-31 when sending a pdf document containing actual text data, not just image scans. 2022-08-31 seems to honor the pdf text, while 2023-07-31 may sometimes produce something different.

As a concrete use case, I have invoice pdfs (with text data) containing mostly cyrillic characters, but also hex-like hashes (a-f,0-9) that I need to extract and further process, in addition to the fields that the prebuilt-invoice model provides. Using the old version, the hashes come out as expected, since the text data is already correct. With the newer version however, I am getting misread characters ("one" vs "lower case L" for example), or cyrillic characters where I would naturally expect latin ones (latin "e" U+0065 vs cyrillic "е" U+0435). As you would imagine, search patterns are confused by this immensely.

Am I correct in my assumption that 2022-08-31 honors pdf text data and 2023-07-31 does not, or is there some other story behind this?

Is there a workaround for 2023-07-31? I would like to avoid downgrading, due to the prebuilt-invoice model supporting more cultures.

Are there any plans for future versions regarding this issue (2023-10-31 and onward)?

Azure Document Intelligence in Foundry Tools

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.