Azure AI Document Intelligence - Custom Classification Model Low Confidence

Estacio, Pedro Vasconcelos 20 Reputation points
2024-06-20T09:44:37.44+00:00

I am working on a Custom Classification Model using Azure AI Document Intelligence and I am getting low confidence scores on my classes. Specifically, the model struggles to differentiate between "Non-Structured documents" and "Reports" due to their similar structure and content.

I have reviewed the documentation and implemented the recommended best practices. However, I am seeking additional tips or strategies that could help improve the model's confidence.

Additionally, I noticed that while the documentation states the maximum allowed number of document samples per class is 100, I added 150 examples to one of my classes and the model trained without issues. My questions are:

  1. Are there any other techniques or additional best practices beyond the provided documentation that can help improve the classification confidence for similar document types?
  2. Does increasing the number of training examples beyond the stated limit of 100 per class provide any benefit in terms of improving model performance and confidence? Is there a recommended upper limit for the number of samples per class?

Thanks.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,710 questions
{count} votes

Accepted answer
  1. dupammi 8,535 Reputation points Microsoft Vendor
    2024-06-20T10:05:38.26+00:00

    Hi @Estacio, Pedro Vasconcelos

    Thank you for using the Microsoft Q&A forum.

    To improve the classification confidence for similar document types, you can try the following techniques:

    1. Ensure that all variations of a document are included in the training dataset. Variations include different formats, for example, digital versus scanned PDFs.
    2. Separate visually distinct document types to train different models. As a general rule, if you remove all user-entered values and the documents look similar, you need to add more training data to the existing model. If the documents are dissimilar, split your training data into different folders and train a model for each variation. You can then compose the different variations into a single model.
    3. Make sure that you don't have any extraneous labels.
    4. For signature and region labeling, don't include the surrounding text.

    Regarding the number of training examples, increasing the number of training examples beyond the stated limit of 100 per class can provide benefits in terms of improving model performance and confidence. However, there is no recommended upper limit for the number of samples per class. You can add as many samples as you need to improve the model's performance, but keep in mind that adding too many samples can lead to overfitting.

    In addition, you can create an "out of scope" class to your custom classification schema to help your model recognize documents that do not belong to any of the defined classes. You can then add a few documents to your dataset to be labeled as "out of scope". The model can learn to recognize irrelevant documents and predict their labels accordingly.

    Regarding the low confidence level for delivery notes, you can try acquiring more training data and ensuring that the dataset is diverse enough. Additionally, you can check the quality of the labeled data.

    I hope this helps!


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.