Share via

Layout Form Recognizer model issue with empty columnHeader content

Rony Avivi 20 Reputation points
2023-06-10T11:21:20.9766667+00:00

Hi,

I am using the Layout Form Recognizer model in Azure in order to extract tables from scanned documents (I've installed this version: azure-ai-formrecognizer==3.2.0). The table columns in the documents are in Hebrew. In the Form Recognizer studio, it works perfectly. However, when I try to create my own Python application to interact with the Form Recognizer service, I get empty column header content. I saw in the documentation that the Form Recognizer studio supports Hebrew only in "Print text in preview" and not "print text". I like the model's performance and wish to use it, and I want to know if there is a way to get the Hebrew column headers to work in my Python application or alternatively, if there is an efficient way to extract the scanned table data from the form recognizer studio preview.

Thanks!

Azure Document Intelligence in Foundry Tools
0 comments No comments

Answer accepted by question author

Arash Hoseinpoor 315 Reputation points
2023-06-10T11:42:27.33+00:00
  1. Custom Model: You can create a custom model using the Form Recognizer service. By creating a custom model, you can train it with your specific Hebrew documents and define the table structure, including column headers, for better extraction accuracy. Custom models allow you to fine-tune the recognition for your specific use case. You can refer to the Azure Form Recognizer documentation for details on creating and training custom models.
  2. Preprocess the documents: If creating a custom model is not feasible, you can preprocess the documents before sending them to the Form Recognizer service. You can use OCR (Optical Character Recognition) libraries like Tesseract or Google Cloud Vision OCR, which offer support for Hebrew text recognition. These libraries can extract the text from the documents, including the column headers, and then you can pass the extracted text to the Form Recognizer service for further table extraction.

Here's a sample workflow for the preprocessing approach:

  • Use an OCR library like Tesseract or Google Cloud Vision OCR to extract the text from the scanned documents.
  • Once you have the extracted text, you can process it to identify the table structure, including column headers, using techniques like pattern matching or regular expressions.
  • Pass the identified table structure (including column headers) and the document text to the Form Recognizer service, specifying the table extraction settings. This allows the service to extract the table data based on the identified structure.

Remember that preprocessing the documents requires additional development effort, but it gives you more control over the extraction process, including handling Hebrew text recognition.

Evaluate both approaches based on your specific requirements and resources available, and choose the one that best suits your needs.

Was this answer helpful?

0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.