Document Intelligence PDF text data usage

Question

Document Intelligence PDF text data usage

Martin P 0

Hi,

I have noticed a difference in behavior between api version 2022-08-31 and 2023-07-31 when sending a pdf document containing actual text data, not just image scans. 2022-08-31 seems to honor the pdf text, while 2023-07-31 may sometimes produce something different.

As a concrete use case, I have invoice pdfs (with text data) containing mostly cyrillic characters, but also hex-like hashes (a-f,0-9) that I need to extract and further process, in addition to the fields that the prebuilt-invoice model provides. Using the old version, the hashes come out as expected, since the text data is already correct. With the newer version however, I am getting misread characters ("one" vs "lower case L" for example), or cyrillic characters where I would naturally expect latin ones (latin "e" U+0065 vs cyrillic "е" U+0435). As you would imagine, search patterns are confused by this immensely.

Am I correct in my assumption that 2022-08-31 honors pdf text data and 2023-07-31 does not, or is there some other story behind this?

Is there a workaround for 2023-07-31? I would like to avoid downgrading, due to the prebuilt-invoice model supporting more cultures.

Are there any plans for future versions regarding this issue (2023-10-31 and onward)?

Ramr-msft 17,836 Reputation points

2023-11-17T13:33:14.06+00:00

Martin P Thanks for the question, I will forward this issue to the team and will be fixed in the near future.