Synapse Pyspark: Load Parquet files created by CREATE EXTERNAL TABLE CETAS command

Question

Hi everybody,

One of my colleagues created an external table on the SQL Serverlesspool using a CETAS order.
The parquet files are located in subfolders created thanks the LOCATION command.

CREATE EXTERNAL TABLE dbo.TEST
WITH (
LOCATION = 'Test/2020/01/01',
DATA_SOURCE = SOURCE,
FILE_FORMAT = PARQUET
)
then other files are created in different date folder YYYY/MM/DD

I would like to read the parquet file for a specific date using Pyspark.
with something like parDF=spark.read.parquet("/file_path/YYYY/MM/DD/156224SSQKDQHKDH.parquet")

The files are located on an ADLSv2 account.

Is there a way to achieve this?

Thanks for your help

Pete

Accepted Answer

Hello anonymous user,
Thanks for the question and using MS Q&A platform.

As we understand the ask here is you are trying to read a paraquet file , please do let us know if its not accurate.

You can do that by adding the storage account as a linked service in the Synapse Studio .

Once done navigate to the paraquet file and select "Load to dataframe"

This will create a script like

%%pyspark
df = spark.read.load('abfss://himanshu@Piepel .dfs.core.windows.net/NYCTaxi/PassengerCountStats.parquet/part-00000-21161a2b-1c65-4a76-9999-0b2403785f46-c000.snappy.parquet', format='parquet')
display(df.limit(10))

Please do let me if you have any queries.
Thanks
Himanshu

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Share via

Synapse Pyspark: Load Parquet files created by CREATE EXTERNAL TABLE CETAS command

0 additional answers