U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

Question

U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

Ibrahim S Iqbal 1

I need to extract data from pdf files and store values to Table, Using Data lake Analytics. Can anyone help me with some examples or procedure on how to achieve this scenario..

1 answer

Your answer

Answer 1

Hi there,

Here are some resources for getting started with U-SQL in Azure Data Lake Analytics :

https://learn.microsoft.com/en-us/u-sql/

https://www.purplefrogsystems.com/paul/category/u-sql/

https://www.mssqltips.com/sqlservertip/5890/azure-data-lake-analytics-using-usql-queries/

About the scenario you talked about, you would have to write a Custom Extractor to read the PDF. Here's a C# example for the same :

using System.Collections.Generic;  
using iTextSharp.text.pdf;  
using iTextSharp.text.pdf.parser;  
using Microsoft.Analytics.Interfaces;  
  
namespace PDFExtractor  
{  
    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]  
    public class PDFExtractor : IExtractor  
    {  
        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)  
        {  
            var reader = new PdfReader(input.BaseStream);  
            for (var page = 1; page <= reader.NumberOfPages; page++)  
            {  
                output.Set(0, page);  
                output.Set(1, ExtractText(reader, page));  
                yield return output.AsReadOnly();  
            }  
        }  
  
        public string ExtractText(PdfReader pdfReader, int pageNum)  
        {  
            var text = PdfTextExtractor.GetTextFromPage(pdfReader, pageNum, new LocationTextExtractionStrategy());  
            // Encode new lines to prevent from line breaking in text editors,  
            // I want nice line after line files  
            return text.Replace("\r", "\\r").Replace("\n", "\\n");  
        }  
    }  
}

You can write something similar in Python.

Ref - https://devblog.xyz/simple-pdf-text-extractor-adla/

Hope this helps.

ChiragMishra-MSFT 956 Reputation points

2020-06-09T06:00:35.76+00:00

@Ibrahim S Iqbal Just checking in to see if the above answer helped. If this answers your query, do click “Accept Answer" and Up-Vote for the same.
ChiragMishra-MSFT 956 Reputation points

2020-06-18T07:06:27.943+00:00

Hi @Ibrahim S Iqbal ,

Was the above answer helpful to you? If yes, please consider accepting it as answer as it would help other community members who stumble upon this thread for a similar/same issue.

Share via

U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

1 answer

Your answer