An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
- Custom Model: You can create a custom model using the Form Recognizer service. By creating a custom model, you can train it with your specific Hebrew documents and define the table structure, including column headers, for better extraction accuracy. Custom models allow you to fine-tune the recognition for your specific use case. You can refer to the Azure Form Recognizer documentation for details on creating and training custom models.
- Preprocess the documents: If creating a custom model is not feasible, you can preprocess the documents before sending them to the Form Recognizer service. You can use OCR (Optical Character Recognition) libraries like Tesseract or Google Cloud Vision OCR, which offer support for Hebrew text recognition. These libraries can extract the text from the documents, including the column headers, and then you can pass the extracted text to the Form Recognizer service for further table extraction.
Here's a sample workflow for the preprocessing approach:
- Use an OCR library like Tesseract or Google Cloud Vision OCR to extract the text from the scanned documents.
- Once you have the extracted text, you can process it to identify the table structure, including column headers, using techniques like pattern matching or regular expressions.
- Pass the identified table structure (including column headers) and the document text to the Form Recognizer service, specifying the table extraction settings. This allows the service to extract the table data based on the identified structure.
Remember that preprocessing the documents requires additional development effort, but it gives you more control over the extraction process, including handling Hebrew text recognition.
Evaluate both approaches based on your specific requirements and resources available, and choose the one that best suits your needs.