U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

Question

I need to extract data from pdf files and store values to Table, Using Data lake Analytics. Can anyone help me with some examples or procedure on how to achieve this scenario..

Answer

Hi there,

Here are some resources for getting started with U-SQL in Azure Data Lake Analytics :

https://learn.microsoft.com/en-us/u-sql/

https://www.purplefrogsystems.com/paul/category/u-sql/

https://www.mssqltips.com/sqlservertip/5890/azure-data-lake-analytics-using-usql-queries/

About the scenario you talked about, you would have to write a Custom Extractor to read the PDF. Here's a C# example for the same :

using System.Collections.Generic;  
using iTextSharp.text.pdf;  
using iTextSharp.text.pdf.parser;  
using Microsoft.Analytics.Interfaces;  
  
namespace PDFExtractor  
{  
    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]  
    public class PDFExtractor : IExtractor  
    {  
        public override IEnumerable Extract(IUnstructuredReader input, IUpdatableRow output)  
        {  
            var reader = new PdfReader(input.BaseStream);  
            for (var page = 1; page <= reader.NumberOfPages; page++)  
            {  
                output.Set(0, page);  
                output.Set(1, ExtractText(reader, page));  
                yield return output.AsReadOnly();  
            }  
        }  
  
        public string ExtractText(PdfReader pdfReader, int pageNum)  
        {  
            var text = PdfTextExtractor.GetTextFromPage(pdfReader, pageNum, new LocationTextExtractionStrategy());  
            // Encode new lines to prevent from line breaking in text editors,  
            // I want nice line after line files  
            return text.Replace("
", "\r").Replace("
", "\n");  
        }  
    }  
}

You can write something similar in Python.

Ref - https://devblog.xyz/simple-pdf-text-extractor-adla/

Hope this helps.

U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

1 answer