How to read pdf,jpeg,gif document using tesseract OCR with C# asp.net MVC

coder rock 296 Reputation points
2023-11-11T21:00:28.0866667+00:00

Currently we have one page and this page contains one file-upload and one submit button, then the user will select a file from file-upload and submit the pdf,jpeg,gif file.

Now my requirement is the user will upload pdf,jpeg,gif read the content of the file with maximum Accuracy and data need to be insert into sql table like below how the file content and our requirement is same like below link

https://nanonets.com/blog/how-to-ocr-purchase-orders-for-automation/#digitising-purchase-orders

I don't know how to start and achieve this. Please suggest and give us a sampling code.

ASP.NET
ASP.NET
A set of technologies in the .NET Framework for building web applications and XML web services.
3,458 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,682 questions
C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
10,851 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Bruce (SqlWork.com) 64,396 Reputation points
    2023-11-12T01:35:51.16+00:00

    You need to pick an OCR vendor or api. You send the documents to them and they return the formatted data.

    you sample vendor has a rest api. You would probably use HttpClient to call.

    1 person found this answer helpful.

  2. Lan Huang-MSFT 29,246 Reputation points Microsoft Vendor
    2023-11-17T09:01:15.5433333+00:00

    Hi @coder rock,

    It is best to provide your test pictures (privacy is not involved) so that we can test the code that satisfies you based on your data.

    I see you've already solved it elsewhere, but I've added comments and some new code based on your needs. You can refer to the code below:

     @using (Html.BeginForm("Index", "Home", FormMethod.Post, new { enctype = "multipart/form-data" }))
        {
            <span>Select File:</span>
            <input type="file" name="postedFile" />
            <input type="submit" value="Upload" />
            <hr />
            <span>@ViewBag.Message</span>
        }
    
     private static readonly HashSet<string> _extractKeys = new HashSet<string> { "Name", "Mobile", "Address" };
            private static readonly HashSet<string> _ignoredKeys = new HashSet<string> { "Bill" };
            public ActionResult Index(HttpPostedFileBase postedFile)
            {
                if (postedFile != null)
                {
                    string filePath = Server.MapPath("~/Uploads/" + Path.GetFileName(postedFile.FileName));
                    postedFile.SaveAs(filePath);
                    string extractText = this.ExtractTextFromImage(filePath);
                    var text = extractText.Replace(Environment.NewLine, "<br />");
                    //var a = "logo Name raj mobile 9038874774 address 6-98 india bill auto generated";
                    //String.Split method uses space character as separator to separate strings
                    var splitLine = text.Split(' ');
                    //The Dictionary<TKey, TValue> is a generic collection that stores key-value pairs in no particular order.
                    //You can create the Dictionary < TKey, TValue > object by passing the type of keys and values it can store.
                    var pairs = new Dictionary<string, string>();
    
                    //Traverse the array to get the Key
                    for (var i = 0; i < splitLine.Length; i++)
                    {
                        //Locate characters based on the length of the resulting string array
                        var candidateKey = splitLine[i];
                        //Check if _extractKeys contains candidateKey
                        if (!_extractKeys.Contains(candidateKey))
                        {
                            //If it is not included, continue execution.
                            continue;
                        }
                        //Traverse the array to get the Value
                        var value = "";      
                        for (var v = i + 1; v < splitLine.Length; v++)
                        {
                            var candidateValuePart = splitLine[v];
                            //Check if next field contains _ignoredKeys or _extractKeys
                            if (_ignoredKeys.Contains(candidateValuePart) || _extractKeys.Contains(candidateValuePart))
                            {
                                
                                i = v - 1;
                                break;
                            }
    
                            value = value + candidateValuePart + " ";
                        }
                        //Gets the dictionary key if the value contains a string in _extractKeys
                        pairs.Add(candidateKey, value.Trim());
                    }
                    foreach (var kv in pairs)
                    {
                        DB.Customers.Add(new Test()
                        {
                            Key = kv.Key,
                            Value = kv.Value
                        });
                    }
                    DB.SaveChanges();
    
                }
    
                return View();
            }
    
            private string ExtractTextFromImage(string filePath)
            {
                string path = Server.MapPath("~/") + Path.DirectorySeparatorChar + "tessdata";
                using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
                {
                    using (Pix pix = Pix.LoadFromFile(filePath))
                    {
                        using (Tesseract.Page page = engine.Process(pix))
                        {
                            return page.GetText();
                        }
                    }
                }
            }
    
    

    Best regards,
    Lan Huang


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.