Document Intelligence PDF text data usage

Martin P 0 Reputation points
2023-11-16T15:19:35.69+00:00

Hi,

I have noticed a difference in behavior between api version 2022-08-31 and 2023-07-31 when sending a pdf document containing actual text data, not just image scans. 2022-08-31 seems to honor the pdf text, while 2023-07-31 may sometimes produce something different.

As a concrete use case, I have invoice pdfs (with text data) containing mostly cyrillic characters, but also hex-like hashes (a-f,0-9) that I need to extract and further process, in addition to the fields that the prebuilt-invoice model provides. Using the old version, the hashes come out as expected, since the text data is already correct. With the newer version however, I am getting misread characters ("one" vs "lower case L" for example), or cyrillic characters where I would naturally expect latin ones (latin "e" U+0065 vs cyrillic "е" U+0435). As you would imagine, search patterns are confused by this immensely.

Am I correct in my assumption that 2022-08-31 honors pdf text data and 2023-07-31 does not, or is there some other story behind this?

Is there a workaround for 2023-07-31? I would like to avoid downgrading, due to the prebuilt-invoice model supporting more cultures.

Are there any plans for future versions regarding this issue (2023-10-31 and onward)?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,057 questions
{count} votes