How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

Question

Currently we have one page and this page contains one file-upload and one submit button, then the user will select a file from file-upload and submit the pdf,jpeg,gif file.

Now my requirement is the user will upload pdf,jpeg,gif read the content of the file with maximum Accuracy and data need to be insert into sql table like below how the file content and our requirement is same like below link

https://nanonets.com/blog/how-to-ocr-purchase-orders-for-automation/#digitising-purchase-orders

I don't know how to start and achieve this. Please suggest and give us a sampling code.

Answer

You need to pick an OCR vendor or api. You send the documents to them and they return the formatted data.

you sample vendor has a rest api. You would probably use HttpClient to call.

Answer

Hi @coder rock,

It is best to provide your test pictures (privacy is not involved) so that we can test the code that satisfies you based on your data.

I see you've already solved it elsewhere, but I've added comments and some new code based on your needs. You can refer to the code below:

 @using (Html.BeginForm("Index", "Home", FormMethod.Post, new { enctype = "multipart/form-data" }))
    {
        Select File:
        
        
        
        @ViewBag.Message
    }

 private static readonly HashSet _extractKeys = new HashSet { "Name", "Mobile", "Address" };
        private static readonly HashSet _ignoredKeys = new HashSet { "Bill" };
        public ActionResult Index(HttpPostedFileBase postedFile)
        {
            if (postedFile != null)
            {
                string filePath = Server.MapPath("~/Uploads/" + Path.GetFileName(postedFile.FileName));
                postedFile.SaveAs(filePath);
                string extractText = this.ExtractTextFromImage(filePath);
                var text = extractText.Replace(Environment.NewLine, "
");
                //var a = "logo Name raj mobile 9038874774 address 6-98 india bill auto generated";
                //String.Split method uses space character as separator to separate strings
                var splitLine = text.Split(' ');
                //The Dictionary is a generic collection that stores key-value pairs in no particular order.
                //You can create the Dictionary < TKey, TValue > object by passing the type of keys and values it can store.
                var pairs = new Dictionary();

                //Traverse the array to get the Key
                for (var i = 0; i < splitLine.Length; i++)
                {
                    //Locate characters based on the length of the resulting string array
                    var candidateKey = splitLine[i];
                    //Check if _extractKeys contains candidateKey
                    if (!_extractKeys.Contains(candidateKey))
                    {
                        //If it is not included, continue execution.
                        continue;
                    }
                    //Traverse the array to get the Value
                    var value = "";      
                    for (var v = i + 1; v < splitLine.Length; v++)
                    {
                        var candidateValuePart = splitLine[v];
                        //Check if next field contains _ignoredKeys or _extractKeys
                        if (_ignoredKeys.Contains(candidateValuePart) || _extractKeys.Contains(candidateValuePart))
                        {
                            
                            i = v - 1;
                            break;
                        }

                        value = value + candidateValuePart + " ";
                    }
                    //Gets the dictionary key if the value contains a string in _extractKeys
                    pairs.Add(candidateKey, value.Trim());
                }
                foreach (var kv in pairs)
                {
                    DB.Customers.Add(new Test()
                    {
                        Key = kv.Key,
                        Value = kv.Value
                    });
                }
                DB.SaveChanges();

            }

            return View();
        }

        private string ExtractTextFromImage(string filePath)
        {
            string path = Server.MapPath("~/") + Path.DirectorySeparatorChar + "tessdata";
            using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
            {
                using (Pix pix = Pix.LoadFromFile(filePath))
                {
                    using (Tesseract.Page page = engine.Process(pix))
                    {
                        return page.GetText();
                    }
                }
            }
        }

Best regards,
Lan Huang

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Share via

How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

2 answers

Your answer