How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

Question

How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

coder rock 436

Currently we have one page and this page contains one file-upload and one submit button, then the user will select a file from file-upload and submit the pdf,jpeg,gif file.

Now my requirement is the user will upload pdf,jpeg,gif read the content of the file with maximum Accuracy and data need to be insert into sql table like below how the file content and our requirement is same like below link

https://nanonets.com/blog/how-to-ocr-purchase-orders-for-automation/#digitising-purchase-orders

I don't know how to start and achieve this. Please suggest and give us a sampling code.

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

2 answers

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Answer 1

Bruce (SqlWork.com) 78,006 Volunteer Moderator

You need to pick an OCR vendor or api. You send the documents to them and they return the formatted data.

you sample vendor has a rest api. You would probably use HttpClient to call.

coder rock 436

We know the ocr third party and now extracting to text also check the below code

 public ActionResult Index(HttpPostedFileBase postedFile)
        {
            if (postedFile != null)
            {
                string filePath = Server.MapPath("~/Uploads/" + Path.GetFileName(postedFile.FileName));
                postedFile.SaveAs(filePath);
                string extractText = this.ExtractTextFromImage(filePath);
                ViewBag.Message = extractText.Replace(Environment.NewLine, "<br />");
            }

            return View();
        }

        private string ExtractTextFromImage(string filePath)
        {
            string path = Server.MapPath("~/") + Path.DirectorySeparatorChar + "tessdata";
            using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
            {
                using (Pix pix = Pix.LoadFromFile(filePath))
                {
                    using (Tesseract.Page page = engine.Process(pix))
                    {
                        return page.GetText();
                    }
                }
            }
        }

How to achieve below requirement to insert into sql table format, now i am able to extract text there is one order form that contains order number and other details that need to insert max accurately into sql table as a orderdetails and ordernumber is the unique one.

https://nanonets.com/blog/how-to-ocr-purchase-orders-for-automation/#digitising-purchase-orders

Bruce (SqlWork.com) 78,006 Reputation points Volunteer Moderator

2023-11-12T18:39:05.68+00:00

The result should key value pairs with a score. You map to the database column name based on the key name. The score will tell how reliable the match was. Only you can decided what score you will accept.

typically you will need to augment the OCR with an app that displays the image and allows the user to correct the data import.

coder rock 436


In below line OCR Extract Text assingning to viewbag

    ViewBag.Message = extractText.Replace(Environment.NewLine, "<br />");

The text is like below assingning to view bag

    ViewBag.Message="Name raj mobile 90000000 address 5-848-7 india";
				
I am not getting how to achieve key value pair to return view bag text paragraph to send below three fields edit form for user to check and validate and save this data. 

Name : raj
Mobile: 90000000
address: 5-848-7 india

If you provide some code will give help a lot to achieve my requirement for key value pair

Answer 2

Hi @coder rock,

It is best to provide your test pictures (privacy is not involved) so that we can test the code that satisfies you based on your data.

I see you've already solved it elsewhere, but I've added comments and some new code based on your needs. You can refer to the code below:

 @using (Html.BeginForm("Index", "Home", FormMethod.Post, new { enctype = "multipart/form-data" }))
    {
        <span>Select File:</span>
        <input type="file" name="postedFile" />
        <input type="submit" value="Upload" />
        <hr />
        <span>@ViewBag.Message</span>
    }

 private static readonly HashSet<string> _extractKeys = new HashSet<string> { "Name", "Mobile", "Address" };
        private static readonly HashSet<string> _ignoredKeys = new HashSet<string> { "Bill" };
        public ActionResult Index(HttpPostedFileBase postedFile)
        {
            if (postedFile != null)
            {
                string filePath = Server.MapPath("~/Uploads/" + Path.GetFileName(postedFile.FileName));
                postedFile.SaveAs(filePath);
                string extractText = this.ExtractTextFromImage(filePath);
                var text = extractText.Replace(Environment.NewLine, "<br />");
                //var a = "logo Name raj mobile 9038874774 address 6-98 india bill auto generated";
                //String.Split method uses space character as separator to separate strings
                var splitLine = text.Split(' ');
                //The Dictionary<TKey, TValue> is a generic collection that stores key-value pairs in no particular order.
                //You can create the Dictionary < TKey, TValue > object by passing the type of keys and values it can store.
                var pairs = new Dictionary<string, string>();

                //Traverse the array to get the Key
                for (var i = 0; i < splitLine.Length; i++)
                {
                    //Locate characters based on the length of the resulting string array
                    var candidateKey = splitLine[i];
                    //Check if _extractKeys contains candidateKey
                    if (!_extractKeys.Contains(candidateKey))
                    {
                        //If it is not included, continue execution.
                        continue;
                    }
                    //Traverse the array to get the Value
                    var value = "";      
                    for (var v = i + 1; v < splitLine.Length; v++)
                    {
                        var candidateValuePart = splitLine[v];
                        //Check if next field contains _ignoredKeys or _extractKeys
                        if (_ignoredKeys.Contains(candidateValuePart) || _extractKeys.Contains(candidateValuePart))
                        {
                            
                            i = v - 1;
                            break;
                        }

                        value = value + candidateValuePart + " ";
                    }
                    //Gets the dictionary key if the value contains a string in _extractKeys
                    pairs.Add(candidateKey, value.Trim());
                }
                foreach (var kv in pairs)
                {
                    DB.Customers.Add(new Test()
                    {
                        Key = kv.Key,
                        Value = kv.Value
                    });
                }
                DB.SaveChanges();

            }

            return View();
        }

        private string ExtractTextFromImage(string filePath)
        {
            string path = Server.MapPath("~/") + Path.DirectorySeparatorChar + "tessdata";
            using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
            {
                using (Pix pix = Pix.LoadFromFile(filePath))
                {
                    using (Tesseract.Page page = engine.Process(pix))
                    {
                        return page.GetText();
                    }
                }
            }
        }

Best regards,
Lan Huang

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

coder rock 436 Reputation points

2023-11-17T10:11:31.0166667+00:00

Can you suggest more sample data. what happens if the name is "John W. Smith". I also need more address samples is not fix na.
coder rock 436 Reputation points

2024-06-14T20:03:04.34+00:00

Sorry for late reply i am lookin for Named Entity Recognition (NER) above answer is not helpful for all scenarios.

Share via

How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

2 answers

Your answer