Issue with accurately Labeling Data for Custom Extraction Model

Marques Chacon 40 Reputation points
2024-12-26T20:37:59.98+00:00

Hello,

I am building a custom extraction model to retrieve invoice numbers and amounts off of check stubs. I am running Form Recognizer Studio via a Docker container, which is currently supported up to API version 3.0.

To build the custom extraction model, I need to label my training data. I am using dynamic table layouts to extract the invoices and amounts off each page. However, there are two main issues:

  1. Sometimes the OCR doesn't read correctly
  2. The Invoice number can often be a part of a larger token

For scenario 1, I might need an invoice number labeled as "456123", but instead the OCR read it as "458123". Is there a way for me to edit the label so that it is accurate to what was read? I am not sure if the model is training on the actual OCR read or just the layout itself. If it's just the layout, then this may not be possible.

For scenario 2, sometimes the invoice may include a date right next to it and the OCR will detect it as one token. For instance: "456123(10/29/2024)". In this case I only want to extract the "456123" part. Would I need to use region labeling for the invoice part only or does region labeling only support multi-token labeling?

If I can't exclude these in training would I need to post-process the results after extraction? Any guidance would be appreciated. Thanks.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,856 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. kothapally Snigdha 870 Reputation points Microsoft Vendor
    2024-12-26T23:37:37.0533333+00:00

    Hi Marques Chacon,

    Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.

    I understand that you are facing with Issue with accurately Labeling Data for Custom Extraction Model.

    For your first scenario regarding OCR inaccuracies, you can edit the labels in the Form Recognizer Studio. The model trains on the labeled data you provide, which means you can adjust the labels to match the correct values, even if the OCR output is incorrect. This allows you to ensure that the training data reflects the accurate values you want to extract.

    In your second scenario, where the invoice number is part of a larger token (like "456123(10/29/2024)"), you would indeed use region labeling to select just the "456123" part. Region labeling allows you to specify the exact area of the document you want to label, which can be useful for extracting specific tokens from a larger string. kindly refer this document https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/v21/label-tool?view=doc-intel-2.1.0

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


  2. D688588 0 Reputation points
    2024-12-27T22:11:44.82+00:00

    All I have been through, by noting the opinion on Outlook or Hotmail, to entering the SAP agreement to change the password, and stop receiving messages in the Gmail account created by the SAP account. They have any Internet or communications service provider, and they own them. We have not gone through this before, but the matter has changed for the worse, from 2022 until now. And every day it gets worse, even if we do not confirm that a party aims to confirm, it is clear that we know that like us, we encourage and support them for development work, and we cooperate with everything that benefits countries and peoples, and we fight every rebel, and we fight every rebellion, and we work to cooperate with you systematically. Our commitment is enough.... All our accounts are not free of breaches, and it is reported from our accounts that we have stopped the project or postponed the project.. There is no benefit in me providing you with our username, and the fox has the master key!!!!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.