Manually Fix Word Boundaries after OCR ?

Cyril Carraz 31 Reputation points
2023-03-02T08:30:11.1133333+00:00

Context :

  • Using Form Recognizer v3
  • Training a Custom Neural Model

Problem :

After analysing a file with handwritten texts, sometimes the OCR bundles a lot of words together in the same clickable entity, which prohibits me from labeling the fields correctly.

Example :

User's image

Here, I should have a field called 'Version' that grabs 'v25.1' or at least 'v25.1)', but I feel like labeling the field as 'v25.1)/outcome :' would invite errors later on.

Sometimes, it's worse because the OCRed entity represents two separate fields and I'm forced to choose which one to label between them.

Question :

Is there a way to manually fix the word boundaries ? Maybe through the json files generated ?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,100 questions
{count} votes

1 answer

Sort by: Most helpful
  1. VasaviLankipalle-MSFT 18,676 Reputation points Moderator
    2023-03-13T20:42:09.5333333+00:00

    Hi @Cyril Carraz , in Custom Neural Model, manually adjusting the boundary will not help because the region is only used to map OCR words, the model still takes OCR words as input in both training and analyzing, and if OCR doesn't split words correctly, the result may still contain extra characters or be missing some characters.

    If you still want to correct the bounding box for the Custom Neural Model, then you must first update the ocr.json (Azure portal->Blob storage) file, after which you will be able to select the corrected words.

    Additionally, Region labeling can also be used to correct the region for Custom Template Model.
    I have already shared this feedback to PG team.

    I hope it helps.

    --Please kindly accept the answer and vote 'Yes' if you feel helpful to support the community, thanks.

    Regards,
    Vasavi

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.