Azure Document Intelligence custom classification model behaves differently with extra pages

Question

Azure Document Intelligence custom classification model behaves differently with extra pages

Sarah Cummings 45

I've trained a custom classification model in Azure Document Intelligence, and the model recognizes if pages in a pdf are one of three form types. I've found that if a user submits files with additional pages at the end of the PDF, it changes the results for the predictions of the form pages.

If the user submits only the form page, we positively identify that page as our form type with 80% confidence. If the user submits a 23 page document with one page being one of the three forms, and the other pages all being junk, we no longer confidently predicting the form page.

Do I need to train my model differently to possibly identify this junk? Or should I create an intermediary step where I split documents and post each page to the classification model?

I'm worried that calling the model one page at a time would take much longer. I'm also wondering if cost is computed differently for 23 1-page requests vs. 1 23-page request.

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2023-10-16T22:29:58.5233333+00:00

@Sarah Cummings , did you get a chance to check my response?
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2023-10-18T03:17:19.78+00:00

Hello @Sarah Cummings , thank you for the feedback will share with the product team. According to my knowledge it's better to create a separate class for the "junk" pages if you're training with files that have them. This can help the model learn to distinguish between the relevant pages and the irrelevant ones and improve its accuracy.

I have converted my comment to an answer please take time in accepting it.

1 answer

Your answer

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2023-10-16T22:29:58.5233333+00:00

@Sarah Cummings , did you get a chance to check my response?
VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2023-10-18T03:17:19.78+00:00

Hello @Sarah Cummings , thank you for the feedback will share with the product team. According to my knowledge it's better to create a separate class for the "junk" pages if you're training with files that have them. This can help the model learn to distinguish between the relevant pages and the irrelevant ones and improve its accuracy.

I have converted my comment to an answer please take time in accepting it.

Answer 1

Hello @Sarah Cummings , Thanks for using Microsoft Q&A Platform.

As we know the Custom classification models in Azure Document Intelligence are designed to process each page of the input file separately and makes a prediction for each page based on its content and layout.

This is a known behavior, the model classifies each page of the input document to one of the classes in the labeled dataset, and additional pages may introduce noise or irrelevant information that can affect the model's predictions.

In this scenario, the possible workaround could be retraining the model with additional data that includes the extra pages to improve the model's accuracy. Or you can split the documents and posting required page to the classification model.

Also please note that training custom models is always free with Document Intelligence. You are only charged when a model is used to analyze a document means it is billed by number of pages that are analyzed. Please visit here for pricing information: https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/

Here is the service limit details for custom models usage: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/service-limits?view=doc-intel-3.1.0#custom-model-usage

I hope this helps.

Regards,
Vasavi

-Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

Sarah Cummings 45 Reputation points

2023-10-16T22:34:24.54+00:00

Hi @VasaviLankipalle-MSFT , yes thank you for your response. It is very helpful. Is it recommended that I create a class for these "junk" pages if I'm training with files that have them? Or is it better to assign the entire file as one of the classes we're looking for? The junk pages don't always looks the same

Alternatively I'm playing around with some preprocessing to use to limit pages before calling my classification model, but it'd be nice for the model to be able to handle the longer files without that step

Share via

Azure Document Intelligence custom classification model behaves differently with extra pages

1 answer

Your answer