creating an External table over partitioned parquet files

Erfan Mahmoodnejad 41 Reputation points
2021-05-13T02:06:04.43+00:00

I have a serverless SQL in Synapse and partitioned Parquet files in ADLS gen 2.
I am able to create a view over it using the OPENROWSET and derive the partitioned keys using the FILEPATH syntax like this:

create view dbo.testView as
select ,r.filepath(1) as Year, r.filepath(2) as Month from
OPENROWSET (
BULK 'Year=
/Month=*/',
DATA_SOURCE = 'mysample',
FORMAT ='PARQUET'
) as [r]

Although a partitioned parquet file can be used to create an external table, I only have access to the columns that have been stored in the parquet files. The partitioned keys of Parquet files have been dropped and stored in the folder hierarchy names, but I was unable to determine how to retrieve them.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,403 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,597 questions
{count} votes

Accepted answer
  1. Samara Soucy - MSFT 5,051 Reputation points
    2021-05-19T03:08:43.043+00:00

    I verified this does work with CETAS, even if there isn't an explicit example in the docs.

    You need to create an external datasource where the new table will be located:

    CREATE EXTERNAL DATA SOURCE [ExternalDataSource] WITH
    (
        LOCATION = 'https://<STORAGEACCOUNT>.blob.core.windows.net/<CONTAINER>'
    )
    

    And a file format, which is simply parquet + the data compression style:

    CREATE EXTERNAL FILE FORMAT snappy
    WITH
    (  
        FORMAT_TYPE = PARQUET,
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
    )
    

    Once that is done, just feed your OPENROWSET into the external table command just like it is for your view:

    CREATE EXTERNAL TABLE table_name
    WITH (
        LOCATION = 'external_tables/',
        DATA_SOURCE = ExternalDataSource,  
        FILE_FORMAT = snappy
    )  
    AS
    select *, r.filepath(1) as Year, r.filepath(2) as Month from
    OPENROWSET (
    BULK 'Year=*/Month=*/*.parquet',
    DATA_SOURCE = 'mysample',
    FORMAT ='PARQUET'
    ) as [r]
    

0 additional answers

Sort by: Most helpful