How to read pdf,jpeg,gif document using tesseract OCR with C# MVC

coder rock 156 Reputation points

Currently we have one page and this page contains one file-upload and one submit button, then the user will select a file from file-upload and submit the pdf,jpeg,gif file.

Now my requirement is the user will upload pdf,jpeg,gif read the content of the file with maximum Accuracy and data need to be insert into sql table like below how the file content and our requirement is same like below link

I don't know how to start and achieve this. Please suggest and give us a sampling code.

A set of technologies in the .NET Framework for building web applications and XML web services.
1,726 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
11,630 questions
A Microsoft web application framework that implements the model-view-controller (MVC) design pattern.
1,061 questions
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
9,496 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Bruce ( 48,881 Reputation points

    You need to pick an OCR vendor or api. You send the documents to them and they return the formatted data.

    you sample vendor has a rest api. You would probably use HttpClient to call.

  2. Lan Huang-MSFT 19,911 Reputation points Microsoft Vendor

    Hi @coder rock,

    It is best to provide your test pictures (privacy is not involved) so that we can test the code that satisfies you based on your data.

    I see you've already solved it elsewhere, but I've added comments and some new code based on your needs. You can refer to the code below:

     @using (Html.BeginForm("Index", "Home", FormMethod.Post, new { enctype = "multipart/form-data" }))
            <span>Select File:</span>
            <input type="file" name="postedFile" />
            <input type="submit" value="Upload" />
            <hr />
     private static readonly HashSet<string> _extractKeys = new HashSet<string> { "Name", "Mobile", "Address" };
            private static readonly HashSet<string> _ignoredKeys = new HashSet<string> { "Bill" };
            public ActionResult Index(HttpPostedFileBase postedFile)
                if (postedFile != null)
                    string filePath = Server.MapPath("~/Uploads/" + Path.GetFileName(postedFile.FileName));
                    string extractText = this.ExtractTextFromImage(filePath);
                    var text = extractText.Replace(Environment.NewLine, "<br />");
                    //var a = "logo Name raj mobile 9038874774 address 6-98 india bill auto generated";
                    //String.Split method uses space character as separator to separate strings
                    var splitLine = text.Split(' ');
                    //The Dictionary<TKey, TValue> is a generic collection that stores key-value pairs in no particular order.
                    //You can create the Dictionary < TKey, TValue > object by passing the type of keys and values it can store.
                    var pairs = new Dictionary<string, string>();
                    //Traverse the array to get the Key
                    for (var i = 0; i < splitLine.Length; i++)
                        //Locate characters based on the length of the resulting string array
                        var candidateKey = splitLine[i];
                        //Check if _extractKeys contains candidateKey
                        if (!_extractKeys.Contains(candidateKey))
                            //If it is not included, continue execution.
                        //Traverse the array to get the Value
                        var value = "";      
                        for (var v = i + 1; v < splitLine.Length; v++)
                            var candidateValuePart = splitLine[v];
                            //Check if next field contains _ignoredKeys or _extractKeys
                            if (_ignoredKeys.Contains(candidateValuePart) || _extractKeys.Contains(candidateValuePart))
                                i = v - 1;
                            value = value + candidateValuePart + " ";
                        //Gets the dictionary key if the value contains a string in _extractKeys
                        pairs.Add(candidateKey, value.Trim());
                    foreach (var kv in pairs)
                        DB.Customers.Add(new Test()
                            Key = kv.Key,
                            Value = kv.Value
                return View();
            private string ExtractTextFromImage(string filePath)
                string path = Server.MapPath("~/") + Path.DirectorySeparatorChar + "tessdata";
                using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
                    using (Pix pix = Pix.LoadFromFile(filePath))
                        using (Tesseract.Page page = engine.Process(pix))
                            return page.GetText();

    Best regards,
    Lan Huang

    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.