What improvements has Microsoft made to the OCR engine in Document Intelligence?
I've read the document at this link: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-custom?view=doc-intel-4.0.0#build-mode
I'm particularly curious about the improvements to the OCR engine. Could someone provide a detailed explanation of how Microsoft has enhanced the OCR engine in this Document Intelligence product?
Azure AI Document Intelligence
-
navba-MSFT 24,795 Reputation points • Microsoft Employee
2024-09-09T02:36:02.14+00:00 @Jessie Chen Welcome to Microsoft Q&A Forum, Thank you for posting your query here!
.
Microsoft has made several significant improvements to the OCR engine in Azure AI Document Intelligence:
- Higher Resolution Scanning: The OCR model now runs at a higher resolution, which enhances the handling of smaller and denser text. More info here.
- Enhanced Text Extraction: The OCR engine has been optimized for better text extraction from various document types, including dense forms and lower-resolution scanned documents. More info here.
- Support for More File Types: The updated OCR model can now extract text from a wider range of file types, including Microsoft Word, Excel, PowerPoint, and HTML documents, in addition to PDFs and images. More info here.
- Improved Paragraph Detection: The OCR capabilities now include advanced paragraph detection, which helps in accurately extracting and organizing text. More info here.
- Advanced Scenarios: The OCR engine supports advanced scenarios like single character boxes and accurate extraction of key fields commonly found in invoices, receipts, and other prebuilt scenarios. More info here.
.
These enhancements make the OCR engine more robust and versatile, improving its performance across a variety of document types and use cases.
.
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.
**
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.
-
Jessie Chen 50 Reputation points
2024-09-09T09:12:12.33+00:00 @navba-MSFT Could you please provide more details about the OCR model? Is it a fine-tuned LLM specifically for document ingestion, or is it a library built on top of Tesseract?
-
navba-MSFT 24,795 Reputation points • Microsoft Employee
2024-09-10T04:46:29.61+00:00 - Could you please share more details about your use case?
- What are you trying to accomplish with OCR ?
- Is it not working in your scenario?
- Are you asking about the internals of OCR on how it is built / developed / trained ?
- Could you please be more specific on what details you need about OCR ?
-
Jessie Chen 50 Reputation points
2024-09-22T15:35:07.87+00:00 @navba-MSFT I would like to understand the technology Microsoft is using in detail to help me decide whether it is suitable for my use case. My goal is to extract all field names and values from invoices, which come in different formats. After extracting the data, I plan to convert it into JSON format. Could you please explain how the OCR is trained internally? Additionally, does Microsoft provide the internals of their OCR system as open source? Lastly, could you clarify which OCR package Microsoft Document Intelligence uses, such as pytesseract or another solution?
-
navba-MSFT 24,795 Reputation points • Microsoft Employee
2024-09-23T06:24:16.2566667+00:00 @Jessie Chen Thanks for getting back. The OCR (Optical Character Recognition) capabilities in Azure AI Document Intelligence are built on advanced machine learning models. The Read model is the core OCR engine used for extracting text from documents. This model is trained on a vast dataset of documents to recognize and extract text accurately, including handwritten and printed text.
.
The training process involves:
- Data Collection: Gathering a diverse set of documents, including various formats and languages.
- Preprocessing: Enhancing the quality of the documents, such as noise reduction and binarization.
- Model Training: Using deep learning techniques to train the model on labeled data, where the text regions are annotated.
- Evaluation and Tuning: Continuously evaluating the model’s performance and fine-tuning it to improve accuracy.
.
Microsoft does not provide the internals of their OCR system as open source. Azure AI Document Intelligence does not use open-source OCR packages like pytesseract. Instead, it utilizes Microsoft’s proprietary OCR technology, which is part of the broader Azure AI services. This technology is optimized for high accuracy and performance, capable of handling complex document layouts and various languages.
.
.
Suitability for Your Use Case
Given your goal to extract field names and values from invoices and convert them into JSON format, Azure AI Document Intelligence is well-suited for this task. It offers prebuilt models specifically designed for invoices, which can automatically identify and extract key fields such as invoice number, date, total amount, and more. You can also create custom models tailored to your specific document formats if needed.
-
Jessie Chen 50 Reputation points
2024-09-23T14:48:57.07+00:00 @navba-MSFT I've tried the prebuilt invoices model provided in Document Intelligence Studio, but the data fields it extracted aren't what I need. How can I configure the invoice model to extract the specific fields I want?
-
navba-MSFT 24,795 Reputation points • Microsoft Employee
2024-09-24T03:58:54.6733333+00:00 @Jessie Chen Thanks for your reply. To extract the specific custom fields, you need to rely on the custom extraction model.
To create a custom extraction model, label a dataset of documents with the values you want extracted and train the model on the labeled dataset. You only need five examples of the same form or document type to get started.
Please refer the Build and train a custom extraction model article.
.
Hope this helps.
Sign in to comment