Is there any good nuget package for PDF text scanner in asp.net core ?

chandra dev 1 Reputation point
2022-02-02T13:28:40.573+00:00

Hi All,

Currently I am using pdfclown nuget package for scanning the text from pdf file in asp.net core project.
https://pdfclown.org/

My requirement is there to read the pdf text and dump in excel file. pdfclown is doing almost everything's but blank space is not reading from pdf file.

170574-image.png

could you please suggest any other alternate nuget package to fulfill this requirement ?

ASP.NET Core
ASP.NET Core
A set of technologies in the .NET Framework for building web applications and XML web services.
4,164 questions
Blazor
Blazor
A free and open-source web framework that enables developers to create web apps using C# and HTML being developed by Microsoft.
1,389 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Bruce (SqlWork.com) 56,021 Reputation points
    2022-02-15T18:29:25.627+00:00

    PDF is programing language that draws text and images. the language is a simple stack machine. to help in parsing PDF, tags support was added to help define the document. in postscript the % is the comment character, %% is used to identify a structure tag

    sample hello world:

    %!PS
    /Palatino-Roman 20 selectfont
    300 400 moveto
    (Hello, World!) show
    showpage
    

    how well a PDF file can be parsed depends on how well the ps program was written, did it follow tag conventions used by the parser. most likely in your sample, the table is a text array, and only has 2 rows of data.

    note: postscript supports arrays of arrays, so a text table should follow this structure. the data and the code to draw the borders are seperate.

    0 comments No comments

  2. winironteam 6 Reputation points
    2023-02-17T04:15:58.5833333+00:00

    we can use IronPDF to extract text From PDF

    
    

    or extract text from a PDF file, page by page.

    
    using PdfDocument PDF = PdfDocument.FromFile("result.pdf");
    for (var index = 0; index < PDF.PageCount; index++)
    {
        int PageNumber = index + 1;
        string Text = PDF.ExtractTextFromPage(index);
    }
    
    
    0 comments No comments