Improve Confidence in Custom Classification model and reject Unknown Documents

Question

Improve Confidence in Custom Classification model and reject Unknown Documents

Christos Sgouros 0

Hello everyone,

I'm encountering an issue while training a custom classification model using API version 2023-07-31 (3.1 General). This model is trained on approximately 100 documents, categorized into two types: Invoices and Delivery Notes. Up until now, the data has been evenly split between these two categories.

Upon testing the model, I noticed that it performs well in recognizing invoice-type documents (between 62-85% accuracy), but struggles with delivery notes. Although it correctly identifies them as delivery notes, the confidence level is quite low (ranging from 9-18%).

A significant problem arises when attempting to classify unrelated documents, such as personal data or poems. Surprisingly, these are also classified as delivery notes, with a confidence level exceeding 60%.

Referring to the documentation, I found the following guidance: "The classifier attempts to assign each document to one of the classes. If you expect the model to encounter document types not present in the training dataset, it's advisable to set a threshold on the classification score or create an 'other' class to include representative samples of such documents. This ensures that irrelevant documents don't impact classifier performance."

My main concern is how to handle rejecting irrelevant documents. While the documentation suggests creating an "other" class, the challenge lies in predicting the diverse types of documents users may input. With hundreds of potential document types, it's impractical to anticipate them all. Any insights on managing this complexity would be greatly appreciated.

Regarding the low confidence in the second class, I assume acquiring more training data for this class might improve performance. However, I'm puzzled by the discrepancy in performance between the two classes, despite having trained them with a similar amount of data. If anyone has suggestions or best practices for optimizing classifiers, I'd be grateful for any information shared.

Thank you in advance for your assistance.

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-04-24T09:59:30.26+00:00

Hi @Christos Sgouros,

Following up to see if the given response was helpful. Thank you.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-04-25T06:10:31.3033333+00:00

Hi @Christos Sgouros,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.

1 answer

Your answer

santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-04-24T09:59:30.26+00:00

Hi @Christos Sgouros,

Following up to see if the given response was helpful. Thank you.
santoshkc 15,325 Reputation points Microsoft External Staff Moderator

2024-04-25T06:10:31.3033333+00:00

Hi @Christos Sgouros,

We haven’t heard from you on the last response and was just checking back to see if the given response was helpful. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.

Answer 1

Hi @Christos Sgouros,

Thank you for providing detailed information about the issue you are encountering while training a custom classification model using API version 2023-07-31 (3.1 General).

Regarding the challenge of rejecting irrelevant documents, creating an "other" class is a good approach. You can add an "out of scope" class to your custom classification schema. This class will help your model recognize documents that do not belong to any of the defined classes. You can then add a few documents to your dataset to be labeled as "out of scope". The model can learn to recognize irrelevant documents and predict their labels accordingly.

Regarding the low confidence level for delivery notes, you can try acquiring more training data and ensuring that the dataset is diverse enough and additionally, you can check the quality of the labeled data.

I hope you understand. Thank you.

Share via

Improve Confidence in Custom Classification model and reject Unknown Documents

1 answer

Your answer