Best practices for training Neural Custom Extraction model

Question

Hello,

I could not find info on best practices to approach training the custom Neural Extraction models. A few parts where I need the most guidance:

If training invoices for lets say Czech language, is it best to only use that languages invoices for training or preferably a mix? (I understand that it is not recommended to use it on other than English, but it might be applicable not just for languages)
Furthermore, should it be a mix of different languages (more the better) or English + Czech as English is the base of this Neural Extraction model?
How easy is it to overfit the Neural model? Or is the approach of more of the same not negatively impacting?
Lets say we have dataset of 10000 invoices from 1000 different vendors. If we can not train on all 10000, would you better suggest a) Training on 1 invoices of 1000 vendors b) Training on 2 invoices of 500 vendors c) Training on 5 invoices of 200 vendors d) Some different approach?
Is the base Neural Template model biased towards invoices, receipts or some other documents, or its base is mostly just read/layout and you need quite a lot ( lets say a few hundred examples of documents) for actually good results?
In my case with training set of 50 invoices I see both General Invoice Model sometimes performing better and in some cases my custom model performing better. I wonder how much more work should I put in so it becomes more robust in my custom case and always outperforms General Invoice. Is there any number I could aim for?

Hopefully my questions are clear enough, it would be super helpful to understand better the insides of how the models work and react to different training approaches.

Thank you

Accepted Answer

@Jul-3831

Thanks for reaching out to us, since you already have a detailed scenario, I am happy to enable you a support ticket for evaluating the best solution for your project.

To answer your question generally, training a custom neural extraction model on a specific language, such as Czech, would ideally require a training dataset primarily composed of documents in that language. The model needs to understand the semantics and structure of the Czech language to accurately extract the required information. If your primary target is Czech invoices, then the majority of your training data should be in Czech.

However, incorporating a mix of languages in your dataset can potentially increase the model's robustness, especially if you expect to process invoices in multiple languages. The key is to ensure that the training data represents the distribution of languages you expect in your actual data. For example, if 70% of your invoices are in Czech, 20% are in English, and 10% are in German, your training data should ideally reflect this distribution. Remember, Neural Network models learn from the data they are provided with. Therefore, the more representative the training data is of the actual data the model will encounter, the better the model is likely to perform.

Please be aware that it's also important to note that training a multilingual model can be more challenging and may require additional steps, such as language detection and handling different date formats, etc.

Secondly, the composition of your training data should ideally reflect the real-world distribution of languages you expect your model to process. If your invoices are primarily in Czech and English, then these two languages should constitute the bulk of your training data.

Adding more languages to the mix could potentially improve the model's generalization capability, but only if you expect to process invoices in those languages in the real-world application. Training the model on languages that it won't encounter can increase complexity without providing tangible benefits and could even negatively impact the model's performance on your target languages.

Then, overfitting is a common concern when training neural networks. It occurs when a model learns to perform very well on its training data but struggles to generalize to unseen data. In other words, an overfitted model has learned the training data too well, to the point of memorizing it, including its noise and outliers, rather than learning the underlying patterns. Whether a model will overfit or not depends on several factors:

Dataset Size: Smaller datasets are more prone to overfitting since the model can easily memorize them. Larger datasets typically provide more variability and help the model generalize better.
Model Complexity: More complex models (those with more parameters) are more likely to overfit than simpler ones, especially when the amount of training data is limited.
Training Duration: Overfitting can occur if the model is trained for too many epochs. After a certain point, the model starts to fit the noise in the training data rather than the signal.
Diversity of Data: If your data is not diverse and representative of the real-world situations the model will encounter, the model will likely overfit to the specific cases it has seen in training. The approach of "more of the same" can potentially lead to overfitting if the additional data is not adding new information or variability. For example, simply duplicating existing training examples won't help the model generalize better.

To prevent overfitting, you can:

Use a validation set to monitor the model's performance during training. If the performance on the validation set starts to degrade while the training performance continues improving, it's a sign of overfitting.
Implement regularization techniques such as dropout, weight decay, or early stopping.
Collect more diverse and representative data.
Reduce the complexity of your model if you have a small dataset.
Use techniques like data augmentation to artificially increase the size and diversity of your dataset.Overfitting is a common concern when training neural networks. It occurs when a model learns to perform very well on its training data but struggles to generalize to unseen data. In other words, an overfitted model has learned the training data too well, to the point of memorizing it, including its noise and outliers, rather than learning the underlying patterns. Whether a model will overfit or not depends on several factors:
Dataset Size: Smaller datasets are more prone to overfitting since the model can easily memorize them. Larger datasets typically provide more variability and help the model generalize better.
Model Complexity: More complex models (those with more parameters) are more likely to overfit than simpler ones, especially when the amount of training data is limited.
Training Duration: Overfitting can occur if the model is trained for too many epochs. After a certain point, the model starts to fit the noise in the training data rather than the signal.
Diversity of Data: If your data is not diverse and representative of the real-world situations the model will encounter, the model will likely overfit to the specific cases it has seen in training.

The approach of "more of the same" can potentially lead to overfitting if the additional data is not adding new information or variability. For example, simply duplicating existing training examples won't help the model generalize better. To prevent overfitting, you can:

Use a validation set to monitor the model's performance during training. If the performance on the validation set starts to degrade while the training performance continues improving, it's a sign of overfitting.
Implement regularization techniques such as dropout, weight decay, or early stopping.
Collect more diverse and representative data.
Reduce the complexity of your model if you have a small dataset.
Use techniques like data augmentation to artificially increase the size and diversity of your dataset.

Please let us know if you need more details, also, we are happy to enable you a free ticket to discuss more with support engineer.

I hope this helps!

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Best practices for training Neural Custom Extraction model

0 additional answers