Random word order in OCR output
Hi
I make use of the Vision API to perform OCR on documents, and subsequently I perform entity extraction on the extracted text. Word order is crucial for entity extraction.
I recently noticed that sometimes a quite random word order is returned. I know that Azure OCR returns weird word orders in case the page is skewed, hence I deskew them before performing OCR on them, but I noticed that the word order issues started arising as well for pages that to the human eye seem perfectly straight. I am not entirely sure this is a recent issue, but I have not encountered it before, and suddenly it happens quite regularly.
I use the following code (python) to obtain OCR results. Note that I specify the model version to use in order to avoid unexpected changes in the results. I have been using this model version for more than a year and have not encountered this issue before a couple of weeks ago. I do not specify the reading_order in order to obtain the default basic reading order (line by line, starting from top left corner, ending at bottom right corner), since the natural reading order contains a lot of word order mistakes, and the entity extraction models are trained on basic word order data.
credentials = CognitiveServicesCredentials(key)
client = ComputerVisionClient(endpoint, credentials)
with open(image_path, "rb") as raw_image:
read_response = client.read_in_stream(raw_image, raw=True, model_version="2022-04-30")
If needed I can send the two documents for which we encountered the word order issue (there are more than two if needed). For the document torfs.jpg the word order issue can be observed on the first line item: the 0.50 is located between the "15291409" and the "JDYCLAUDE" of the second line item in the OCR output. For the document steel.jpg, the issue is also on the first line item: the words from '384' until '82 of the second line item are located between the '0.10' and the '83' of the first line item in the OCR output.
Can this be investigated and fixed asap?
Thank you!