How to get/read specific word from pdf or image document in ASP.NET Application

BeUnique 2,112 Reputation points
2024-05-26T03:09:00.3533333+00:00

I have millions of image or pdf document...I want get some specific word from the uploaded image or pdf document...after getting this, I am going to use these in my project...All this I want in asp.net application and if any third-party tools are using, it should be open source code... How to do this to get the data from image or pdf document using asp.net application.

ASP.NET
ASP.NET
A set of technologies in the .NET Framework for building web applications and XML web services.
3,470 questions
C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
10,868 questions
{count} votes

Accepted answer
  1. Lan Huang-MSFT 29,321 Reputation points Microsoft Vendor
    2024-05-27T06:55:52.56+00:00

    Hi @Gani_tpt,

    You can use the Tesseract OCR library to read or extract text from images, and the iTextSharp library to extract text from PDFs.

    iTextSharp

    https://www.nuget.org/packages/iTextSharp

    User's image

    Tesseract

    https://www.nuget.org/packages/Tesseract

    User's image

    Downloading and configuring Tesseract Data Files

    You will need to download the Tesseract Data files from the following link.

    https://github.com/tesseract-ocr/tessdata

    Then copy it to the project root folder and rename it to tessdata.

    Simple example:

    <html xmlns="http://www.w3.org/1999/xhtml">
    <head runat="server">
        <title></title>
    </head>
    <body>
        <form id="form1" runat="server">
            Select File:
    <asp:FileUpload ID="FileUpload1" runat="server" />
    <asp:Button Text="Upload" runat="server" OnClick="OnUpload" />
    <hr />
    <asp:Label ID="lblText" runat="server" />
        </form>
    </body>
    </html>
    
    using iTextSharp.text.pdf.parser;
    using iTextSharp.text.pdf;
    using System;
    using System.Text;
    using Tesseract;
    namespace WebFormDemo
    {
        public partial class WebForm21 : System.Web.UI.Page
        {
            protected void Page_Load(object sender, EventArgs e)
            {
            }
            protected void OnUpload(object sender, EventArgs e)
            {
                string filePath = Server.MapPath("~/Uploads/" + System.IO.Path.GetFileName(FileUpload1.PostedFile.FileName));
                string ext = System.IO.Path.GetExtension(filePath);
                FileUpload1.SaveAs(filePath);
                string extractText = "";
                if (ext.ToLower() == ".pdf")
                {
                    extractText = this.ExtractTextFromPdf(filePath);
                }
                else if (ext.ToLower() == ".jpg" || ext.ToLower() == ".png" || ext.ToLower() == ".jpeg")
                {
                    extractText = this.ExtractTextFromImage(filePath);
                }
                else
                {
                    extractText = "Only images and pdf are allowed";
                }
                lblText.Text = extractText;
            }
            private string ExtractTextFromPdf(string path)
            {
                using (PdfReader reader = new PdfReader(path))
                {
                    StringBuilder text = new StringBuilder();
                    ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();
                    string page = "";
                    for (int i = 1; i <= reader.NumberOfPages; i++)
                    {
                        page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
                    }
                    return page;
                }
            }
            private string ExtractTextFromImage(string filePath)
            {
                string path = Server.MapPath("~/") + System.IO.Path.DirectorySeparatorChar + "tessdata";
                using (TesseractEngine engine = new TesseractEngine(path, "eng", EngineMode.Default))
                {
                    using (Pix pix = Pix.LoadFromFile(filePath))
                    {
                        using (Tesseract.Page page = engine.Process(pix))
                        {
                            return page.GetText();
                        }
                    }
                }
            }
        }
    }
    

    Best regards,
    Lan Huang


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


1 additional answer

Sort by: Most helpful
  1. Azizkhon Ishankhonov 435 Reputation points
    2024-05-26T15:37:44.27+00:00

    Hi

    No matter what technical stack you are using simple solution will look like the following way:

    • Pre-process your image. Extract text from images or PDF files and store it(id, image URL, extracted text) somewhere where you can implement searching by a certain word
    • Implement your searching API or application based on extracted text

    The most popular open-source OCR library that you can use in .Net: Tesseract


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.