Improve Confidence in Custom Classification model and reject Unknown Documents

Christos Sgouros 0 Reputation points
2024-04-22T13:11:08.9533333+00:00

Hello everyone,

I'm encountering an issue while training a custom classification model using API version 2023-07-31 (3.1 General). This model is trained on approximately 100 documents, categorized into two types: Invoices and Delivery Notes. Up until now, the data has been evenly split between these two categories.

Upon testing the model, I noticed that it performs well in recognizing invoice-type documents (between 62-85% accuracy), but struggles with delivery notes. Although it correctly identifies them as delivery notes, the confidence level is quite low (ranging from 9-18%).

A significant problem arises when attempting to classify unrelated documents, such as personal data or poems. Surprisingly, these are also classified as delivery notes, with a confidence level exceeding 60%.

Referring to the documentation, I found the following guidance: "The classifier attempts to assign each document to one of the classes. If you expect the model to encounter document types not present in the training dataset, it's advisable to set a threshold on the classification score or create an 'other' class to include representative samples of such documents. This ensures that irrelevant documents don't impact classifier performance."

My main concern is how to handle rejecting irrelevant documents. While the documentation suggests creating an "other" class, the challenge lies in predicting the diverse types of documents users may input. With hundreds of potential document types, it's impractical to anticipate them all. Any insights on managing this complexity would be greatly appreciated.

Regarding the low confidence in the second class, I assume acquiring more training data for this class might improve performance. However, I'm puzzled by the discrepancy in performance between the two classes, despite having trained them with a similar amount of data. If anyone has suggestions or best practices for optimizing classifiers, I'd be grateful for any information shared.

Thank you in advance for your assistance.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,389 questions
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 4,425 Reputation points Microsoft Vendor
    2024-04-23T04:41:02.5233333+00:00

    Hi @Christos Sgouros,

    Thank you for providing detailed information about the issue you are encountering while training a custom classification model using API version 2023-07-31 (3.1 General).

    Regarding the challenge of rejecting irrelevant documents, creating an "other" class is a good approach. You can add an "out of scope" class to your custom classification schema. This class will help your model recognize documents that do not belong to any of the defined classes. You can then add a few documents to your dataset to be labeled as "out of scope". The model can learn to recognize irrelevant documents and predict their labels accordingly.

    Regarding the low confidence level for delivery notes, you can try acquiring more training data and ensuring that the dataset is diverse enough and additionally, you can check the quality of the labeled data.

    I hope you understand. Thank you.

    1 person found this answer helpful.
    0 comments No comments