Hi @Estacio, Pedro Vasconcelos
Thank you for using the Microsoft Q&A forum.
To improve the classification confidence for similar document types, you can try the following techniques:
- Ensure that all variations of a document are included in the training dataset. Variations include different formats, for example, digital versus scanned PDFs.
- Separate visually distinct document types to train different models. As a general rule, if you remove all user-entered values and the documents look similar, you need to add more training data to the existing model. If the documents are dissimilar, split your training data into different folders and train a model for each variation. You can then compose the different variations into a single model.
- Make sure that you don't have any extraneous labels.
- For signature and region labeling, don't include the surrounding text.
Regarding the number of training examples, increasing the number of training examples beyond the stated limit of 100 per class can provide benefits in terms of improving model performance and confidence. However, there is no recommended upper limit for the number of samples per class. You can add as many samples as you need to improve the model's performance, but keep in mind that adding too many samples can lead to overfitting.
In addition, you can create an "out of scope" class to your custom classification schema to help your model recognize documents that do not belong to any of the defined classes. You can then add a few documents to your dataset to be labeled as "out of scope". The model can learn to recognize irrelevant documents and predict their labels accordingly.
Regarding the low confidence level for delivery notes, you can try acquiring more training data and ensuring that the dataset is diverse enough. Additionally, you can check the quality of the labeled data.
I hope this helps!
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.