U-SQL + Python PDF file parsing.in Azure Data Lake Analytics

Ibrahim S Iqbal 1 Reputation point
2020-05-26T12:20:19.85+00:00

I need to extract data from pdf files and store values to Table, Using Data lake Analytics. Can anyone help me with some examples or procedure on how to achieve this scenario..

Azure Data Lake Analytics
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. ChiragMishra-MSFT 951 Reputation points
    2020-05-27T06:53:43.763+00:00

    Hi there,

    Here are some resources for getting started with U-SQL in Azure Data Lake Analytics :

    https://learn.microsoft.com/en-us/u-sql/

    https://www.purplefrogsystems.com/paul/category/u-sql/

    https://www.mssqltips.com/sqlservertip/5890/azure-data-lake-analytics-using-usql-queries/

    About the scenario you talked about, you would have to write a Custom Extractor to read the PDF. Here's a C# example for the same :

    using System.Collections.Generic;  
    using iTextSharp.text.pdf;  
    using iTextSharp.text.pdf.parser;  
    using Microsoft.Analytics.Interfaces;  
      
    namespace PDFExtractor  
    {  
        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]  
        public class PDFExtractor : IExtractor  
        {  
            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)  
            {  
                var reader = new PdfReader(input.BaseStream);  
                for (var page = 1; page <= reader.NumberOfPages; page++)  
                {  
                    output.Set(0, page);  
                    output.Set(1, ExtractText(reader, page));  
                    yield return output.AsReadOnly();  
                }  
            }  
      
            public string ExtractText(PdfReader pdfReader, int pageNum)  
            {  
                var text = PdfTextExtractor.GetTextFromPage(pdfReader, pageNum, new LocationTextExtractionStrategy());  
                // Encode new lines to prevent from line breaking in text editors,  
                // I want nice line after line files  
                return text.Replace("\r", "\\r").Replace("\n", "\\n");  
            }  
        }  
    }  
    

    You can write something similar in Python.

    Ref - https://devblog.xyz/simple-pdf-text-extractor-adla/

    Hope this helps.