creating an External table over partitioned parquet files

Question

I have a serverless SQL in Synapse and partitioned Parquet files in ADLS gen 2.
I am able to create a view over it using the OPENROWSET and derive the partitioned keys using the FILEPATH syntax like this:

create view dbo.testView as
select ,r.filepath(1) as Year, r.filepath(2) as Month from
OPENROWSET (
BULK 'Year=/Month=*/',
DATA_SOURCE = 'mysample',
FORMAT ='PARQUET'
) as [r]

Although a partitioned parquet file can be used to create an external table, I only have access to the columns that have been stored in the parquet files. The partitioned keys of Parquet files have been dropped and stored in the folder hierarchy names, but I was unable to determine how to retrieve them.

Accepted Answer

I verified this does work with CETAS, even if there isn't an explicit example in the docs.

You need to create an external datasource where the new table will be located:

CREATE EXTERNAL DATA SOURCE [ExternalDataSource] WITH
(
    LOCATION = 'https://.blob.core.windows.net/'
)

And a file format, which is simply parquet + the data compression style:

CREATE EXTERNAL FILE FORMAT snappy
WITH
(  
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)

Once that is done, just feed your OPENROWSET into the external table command just like it is for your view:

CREATE EXTERNAL TABLE table_name
WITH (
    LOCATION = 'external_tables/',
    DATA_SOURCE = ExternalDataSource,  
    FILE_FORMAT = snappy
)  
AS
select *, r.filepath(1) as Year, r.filepath(2) as Month from
OPENROWSET (
BULK 'Year=*/Month=*/*.parquet',
DATA_SOURCE = 'mysample',
FORMAT ='PARQUET'
) as [r]

Share via

creating an External table over partitioned parquet files

0 additional answers