Share via

Querying partitioned data with SQL on-demand?

Veli-Jussi Raitila 461 Reputation points
2020-08-18T11:25:07.39+00:00

It is common to partition data within data lakes and this is reflected in the form of a folder hierarchy.

What is the proper way to query partitioned data with SQL on-demand within Azure Synapse Analytics, as according to my understanding e.g. OPENROWSET does not process subfolders at all?

Although not strictly an example of Hadoop-style partitioning, Azure Event Hubs Capture outputs the Avro files in a deeply nested structure. How would you process this with SQL on-demand (ignoring for a moment that Avro is not supported as file format either)?

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.


1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,161 Reputation points
    2020-08-19T22:20:52.603+00:00

    Hello VeliJussiRaitila-8710 and thank you for your question.

    To my understanding OPENROWSET can process multiple files from multiple folders.

    Suppose we have files stored in path like mydata/year/month/day/file :

    mydata/2020/02/03/file1.csv
    mydata/2020/02/03/file2.csv
    mydata/2020/02/15/file1.csv
    mydata/2020/02/15/file2.csv
    mydata/2020/03/20/file1.csv

    If you want to read all the above files, it would look like

       SELECT *  
       FROM OPENROWSET(  
            BULK 'mydata/2020/*/*/file*.csv' ,  
            DATA_SOURCE = 'myExternalDataSource' ,  
            FORMAT = 'CSV', PARSER_VERSION = '2.0',  
            FIRSTROW = 2  
       )  
    

    If I wanted to select data from all months, but only days after 14, it could look something like

       SELECT r.filepath(1) as [MONTH], r.filepath(2) as [DAY], data  
       FROM OPENROWSET(  
            BULK 'mydata/2020/*/*/file*.csv' ,  
            DATA_SOURCE = 'myExternalDataSource' ,  
            FORMAT = 'CSV', PARSER_VERSION = '2.0',  
            FIRSTROW = 2  
       )  
       AS [r]  
       WHERE r.filepath(2) > '14'  
    

    References:
    Use filename and filepath functions to target specific partitions
    Use file metadata in queries
    Query folders and multiple files

    Please let me know if this helps, or if I misunderstood your ask.
    Thanks,
    Martin

    Was this answer helpful?

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.