How to improve the results for custom extraction models?

OmniRob 1 Reputation point
2024-05-07T11:38:52.27+00:00

Hello there,

we've been using Azure Document Intelligence for the past few weeks for recognizing answers in a scanned survey document and are still trying to get accustomed to it. At the moment we run into problems, that leave us confused. The main problems are

  1. text fields not being recognized properly
  2. understanding draw regions
  3. checkbox recognition being inconsistent

1 - Text fields

We have this problem for several labels and documents, but I'm gonna explain it on one example.

We have a question for the date of birth which we use two labels for.

User's image

dobm for month and doby for year. The custom model recognizes dobm 100%. With doby however, it sometimes doesn't recognize the field at all and then for some reason supplements the value with a seemingly random text three pages later.

User's image

Other times it might supplement the value with a random text on the same page. But it's always a text that has never been labeled or been part of a draw region in all the training data.

2 - Draw regions

We put a draw region down in an area where there frequently is text in the training data and it either never recognizes the text at all or only parts of it. For example. we have a big text area where the person doing the survey can write down some remarks. If we put a draw region around the entire field the model will either recognize nothing, just a few words or only one line of text if it is multiline.

3 - Checkbox recognition

Overall the checkbox recognition is pretty good, but the inconsistencies are confusing. You can have eight questions on one page with six checkboxes (or radio buttons) and have the model recognize the boxes for seven questions and then not recognize every box of one question in a seemingly random way. It doesn't seem to realize there is supposed to be a box. And similar to the text fields, it sometimes recognizes a seemingly random part of the document as one or more of these boxes.

Furthermore, sometimes you have a survey where the user seemed to have used an almost empty pen to fill out the document and it's even difficult for us to recognize if they drew an X into a box or not. Other times a user might miss a box completely and draw an X, Y, \ or check mark close to the box. In both situations the trained model correctly recognizes those fields.

In another instance a user might have a commendable penmanship where the written text or checked boxes are very clear and easily recognizable to the human eye, but the model recognizes the box as "unselected".

All of these situations are confusing to us and we're looking for guidance on the matter.

Further information

We gave the documentation a good read, but had conflicting experiences regarding the training data tips.

The following statements are what caught our eyes:

  • Use text-based PDF documents instead of image-based documents.
  • Use examples that have all of their fields filled in for completed forms.
  • Use forms with different values in each field.

We trained a model based on these recommendations and found, that the quality of the results were sadly worse than with the models where we used scanned documents for training.

Furthermore the input requirements left us a bit confused.

"For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model."

What is "1G-MB"? Is 50 the maximum, minimum or recommended amount of total size for training or is it always 50 MB no matter the circumstances?

We are unsure if all our problems can be solved with more training data, as we had an instance in our testing phase where more training data was decreasing the quality of the results even though the documents had a good margin for image quality, position and alignment.

We would appreciate any help in this matter. Thank you for your time.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,449 questions
{count} votes

1 answer

Sort by: Most helpful
  1. kxr 0 Reputation points
    2024-05-21T04:37:38.84+00:00

    @OmniRob I don't have answers to your issues, but this is my experience. 1 - Yes, it maps some random text from the page. This extremely annoying. This used to happen a lot in Form Recognizer v2 and earlier, especially in a document with very close fields. Azure DI has been fairly decent happens very rarely, it seems to be happening more in v3.1 than v3.0 - I don't have data to back that up, this how it seems in our recent tests.

    The best option for training is to start with the minimum 5 forms with all fields filled in - scanned pdfs seem to work fine. From here, based on what you see in testing add more forms to the training data that are closer to real data - not all fields filled in, train a new model and see how the new model is. Adding more documents to the training data does not necessarily make it better.

    You will have to test properly after every time you train a model with additional training data, since the adding the new training data might fix a field that you are having a problem with, but you have to make sure it is not messing something else.

    No direct solution for the random text that gets mapped to a field

    0 comments No comments