How to get/read specific word from pdf or image document in ASP.NET Application

Question

I have millions of image or pdf document...I want get some specific word from the uploaded image or pdf document...after getting this, I am going to use these in my project...All this I want in asp.net application and if any third-party tools are using, it should be open source code... How to do this to get the data from image or pdf document using asp.net application.

Accepted Answer

Hi @Gani_tpt,

You can use the Tesseract OCR library to read or extract text from images, and the iTextSharp library to extract text from PDFs.

iTextSharp

https://www.nuget.org/packages/iTextSharp

User's image

Tesseract

https://www.nuget.org/packages/Tesseract

User's image

Downloading and configuring Tesseract Data Files

You will need to download the Tesseract Data files from the following link.

https://github.com/tesseract-ocr/tessdata

Then copy it to the project root folder and rename it to tessdata.

Simple example:



    


    
        Select File:

using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
using System;
using System.Text;
using Tesseract;
namespace WebFormDemo
{
    public partial class WebForm21 : System.Web.UI.Page
    {
        protected void Page_Load(object sender, EventArgs e)
        {
        }
        protected void OnUpload(object sender, EventArgs e)
        {
            string filePath = Server.MapPath("~/Uploads/" + System.IO.Path.GetFileName(FileUpload1.PostedFile.FileName));
            string ext = System.IO.Path.GetExtension(filePath);
            FileUpload1.SaveAs(filePath);
            string extractText = "";
            if (ext.ToLower() == ".pdf")
            {
                extractText = this.ExtractTextFromPdf(filePath);
            }
            else if (ext.ToLower() == ".jpg" || ext.ToLower() == ".png" || ext.ToLower() == ".jpeg")
            {
                extractText = this.ExtractTextFromImage(filePath);
            }
            else
            {
                extractText = "Only images and pdf are allowed";
            }
            lblText.Text = extractText;
        }
        private string ExtractTextFromPdf(string path)
        {
            using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();
                ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();
                string page = "";
                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
                }
                return page;
            }
        }
        private string ExtractTextFromImage(string filePath)
        {
            string path = Server.MapPath("~/") + System.IO.Path.DirectorySeparatorChar + "tessdata";
            using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
            {
                using (Pix pix = Pix.LoadFromFile(filePath))
                {
                    using (Tesseract.Page page = engine.Process(pix))
                    {
                        return page.GetText();
                    }
                }
            }
        }
    }
}

Best regards,
Lan Huang

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

Hi

No matter what technical stack you are using simple solution will look like the following way:

Pre-process your image. Extract text from images or PDF files and store it(id, image URL, extracted text) somewhere where you can implement searching by a certain word
Implement your searching API or application based on extracted text

The most popular open-source OCR library that you can use in .Net: Tesseract

Share via

How to get/read specific word from pdf or image document in ASP.NET Application

1 additional answer

Your answer