Document Intelligence PDF text data usage
Hi,
I have noticed a difference in behavior between api version 2022-08-31 and 2023-07-31 when sending a pdf document containing actual text data, not just image scans. 2022-08-31 seems to honor the pdf text, while 2023-07-31 may sometimes produce something different.
As a concrete use case, I have invoice pdfs (with text data) containing mostly cyrillic characters, but also hex-like hashes (a-f,0-9) that I need to extract and further process, in addition to the fields that the prebuilt-invoice model provides. Using the old version, the hashes come out as expected, since the text data is already correct. With the newer version however, I am getting misread characters ("one" vs "lower case L" for example), or cyrillic characters where I would naturally expect latin ones (latin "e" U+0065 vs cyrillic "е" U+0435). As you would imagine, search patterns are confused by this immensely.
Am I correct in my assumption that 2022-08-31 honors pdf text data and 2023-07-31 does not, or is there some other story behind this?
Is there a workaround for 2023-07-31? I would like to avoid downgrading, due to the prebuilt-invoice model supporting more cultures.
Are there any plans for future versions regarding this issue (2023-10-31 and onward)?