Hello everyone,
I'm encountering an issue while training a custom classification model using API version 2023-07-31 (3.1 General). This model is trained on approximately 100 documents, categorized into two types: Invoices and Delivery Notes. Up until now, the data has been evenly split between these two categories.
Upon testing the model, I noticed that it performs well in recognizing invoice-type documents (between 62-85% accuracy), but struggles with delivery notes. Although it correctly identifies them as delivery notes, the confidence level is quite low (ranging from 9-18%).
A significant problem arises when attempting to classify unrelated documents, such as personal data or poems. Surprisingly, these are also classified as delivery notes, with a confidence level exceeding 60%.
Referring to the documentation, I found the following guidance: "The classifier attempts to assign each document to one of the classes. If you expect the model to encounter document types not present in the training dataset, it's advisable to set a threshold on the classification score or create an 'other' class to include representative samples of such documents. This ensures that irrelevant documents don't impact classifier performance."
My main concern is how to handle rejecting irrelevant documents. While the documentation suggests creating an "other" class, the challenge lies in predicting the diverse types of documents users may input. With hundreds of potential document types, it's impractical to anticipate them all. Any insights on managing this complexity would be greatly appreciated.
Regarding the low confidence in the second class, I assume acquiring more training data for this class might improve performance. However, I'm puzzled by the discrepancy in performance between the two classes, despite having trained them with a similar amount of data. If anyone has suggestions or best practices for optimizing classifiers, I'd be grateful for any information shared.
Thank you in advance for your assistance.