Can you explain to me about Azure AI Doc Intel Layout accuracy when evaluating PDFs with actual digital characters?

Question

Hello,

So I know that Azure AI Doc Intel uses OCR in conjunction with machine learning. When it is evaluating a PDF with actual ASCII or Unicode characters in the file how could it possibly get those values (characters) wrong? Why is there a given accuracy for those values? Does it actually use Optical Character Recognition on chunks of text/numbers which are not images or handwriting? Please explain to me how this works in detail, as it is important for us to understand how these accuracy numbers might affect our results.

By the way, all of our inputs are digitally created PDFs, not scanned pages or from photos.

Thanks.

Accepted Answer

@Javier Cordova Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Please find the answers below inline to your questions:

Question: So I know that Azure AI Doc Intel uses OCR in conjunction with machine learning. When it is evaluating a PDF with actual ASCII or Unicode characters in the file how could it possibly get those values (characters) wrong? Why is there a given accuracy for those values?

Answer: Even when evaluating PDFs with actual ASCII or Unicode characters, errors can occur due to several factors:

Font and Formatting Variations: Different fonts, sizes, and formatting can affect OCR accuracy.
Document Quality: Low resolution or poor quality of the digital document can lead to misinterpretation.
Character Similarity: Characters that look similar (e.g., ‘1’ and ‘l’, ‘0’ and ‘O’) can be misrecognized. These factors contribute to the accuracy score, which represents the model’s confidence in its predictions.
Using custom models you can label the data and train the model to improve the accuracy and confidence scores based on your requirement.

For more information on this, please see here.

.

Question: Does it actually use Optical Character Recognition on chunks of text/numbers which are not images or handwriting? Please explain to me how this works in detail, as it is important for us to understand how these accuracy numbers might affect our results.

Answer: Yes, Azure AI Document Intelligence uses OCR on all text, including digital text in PDFs. Here’s how it works:

Text Extraction: The OCR engine scans the document and extracts text, regardless of whether it is an image or digital text.
Text Recognition: It recognizes and converts the text into machine-readable format.
Machine Learning: The extracted text is then processed using machine learning models to understand the context and structure. This process ensures that all text, whether digital or from images, is accurately recognized and processed. However, the accuracy can still be affected by the factors mentioned earlier.

For more information on this, please see here.

.

Question: Can you explain to me about Azure AI Doc Intel Layout accuracy when evaluating PDFs with actual digital characters?

Answer: The layout accuracy in Azure AI Document Intelligence refers to how well the model can understand and preserve the structure of the document, including text, tables, and other elements. For PDFs with digital characters, the layout accuracy is generally high because:

Consistent Formatting: Digital PDFs usually have consistent formatting, making it easier for the model to interpret.
Clear Structure: The clear structure of digital documents helps in accurately identifying and extracting elements. However, complex layouts or unusual formatting can still pose challenges, and the accuracy score reflects the model’s confidence in correctly interpreting the document’s layout.

.

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

**

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Can you explain to me about Azure AI Doc Intel Layout accuracy when evaluating PDFs with actual digital characters?

0 additional answers

Your answer