How to improve the results for custom extraction models?
Hello there,
we've been using Azure Document Intelligence for the past few weeks for recognizing answers in a scanned survey document and are still trying to get accustomed to it. At the moment we run into problems, that leave us confused. The main problems are
- text fields not being recognized properly
- understanding draw regions
- checkbox recognition being inconsistent
1 - Text fields
We have this problem for several labels and documents, but I'm gonna explain it on one example.
We have a question for the date of birth which we use two labels for.
dobm
for month and doby
for year. The custom model recognizes dobm
100%. With doby
however, it sometimes doesn't recognize the field at all and then for some reason supplements the value with a seemingly random text three pages later.
Other times it might supplement the value with a random text on the same page. But it's always a text that has never been labeled or been part of a draw region in all the training data.
2 - Draw regions
We put a draw region down in an area where there frequently is text in the training data and it either never recognizes the text at all or only parts of it. For example. we have a big text area where the person doing the survey can write down some remarks. If we put a draw region around the entire field the model will either recognize nothing, just a few words or only one line of text if it is multiline.
3 - Checkbox recognition
Overall the checkbox recognition is pretty good, but the inconsistencies are confusing. You can have eight questions on one page with six checkboxes (or radio buttons) and have the model recognize the boxes for seven questions and then not recognize every box of one question in a seemingly random way. It doesn't seem to realize there is supposed to be a box. And similar to the text fields, it sometimes recognizes a seemingly random part of the document as one or more of these boxes.
Furthermore, sometimes you have a survey where the user seemed to have used an almost empty pen to fill out the document and it's even difficult for us to recognize if they drew an X into a box or not. Other times a user might miss a box completely and draw an X, Y, \ or check mark close to the box. In both situations the trained model correctly recognizes those fields.
In another instance a user might have a commendable penmanship where the written text or checked boxes are very clear and easily recognizable to the human eye, but the model recognizes the box as "unselected".
All of these situations are confusing to us and we're looking for guidance on the matter.
Further information
We gave the documentation a good read, but had conflicting experiences regarding the training data tips.
The following statements are what caught our eyes:
- Use text-based PDF documents instead of image-based documents.
- Use examples that have all of their fields filled in for completed forms.
- Use forms with different values in each field.
We trained a model based on these recommendations and found, that the quality of the results were sadly worse than with the models where we used scanned documents for training.
Furthermore the input requirements left us a bit confused.
"For custom extraction model training, the total size of training data is 50 MB for template model and 1G-MB for the neural model."
What is "1G-MB"? Is 50 the maximum, minimum or recommended amount of total size for training or is it always 50 MB no matter the circumstances?
We are unsure if all our problems can be solved with more training data, as we had an instance in our testing phase where more training data was decreasing the quality of the results even though the documents had a good margin for image quality, position and alignment.
We would appreciate any help in this matter. Thank you for your time.