Reading multiple parquet files in synapse via wildcard in synapse?

Question

Reading multiple parquet files in synapse via wildcard in synapse?

Abhinesh Kumar Lal Karn 0

Unable to read multiple parquet file via integrated datasets? User's image

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-14T08:19:59.5833333+00:00
@Abhinesh Kumar Lal Karn
Thanks for reaching out to Microsoft Q&A. here are other options to read multiple Parquet files in Azure without using Spark pools. Here are a few methods: Method 1: Using Azure Synapse Serverless SQL pool Azure Synapse Serverless SQL pool allows you to query Parquet files directly. You can use the OPENROWSET function to read the content of Parquet files by providing the URL to your file. Here’s an example:

SELECT * FROM OPENROWSET( BULK 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/', FORMAT='PARQUET' ) AS [result]

Please replace 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/' with the actual path to your Parquet files. **

Method 2: Using Azure Data Factory** Azure Data Factory can load multiple files in parallel. You can create a metadata-driven pipeline where you will load multiple types of flat file dynamically Hope this helps. Do let us know if you any further queries.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-15T05:21:04.5566667+00:00

@Abhinesh Kumar Lal Karn We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-16T10:33:56.2266667+00:00

@Abhinesh Kumar Lal Karn just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-14T08:19:59.5833333+00:00

@Abhinesh Kumar Lal Karn
Thanks for reaching out to Microsoft Q&A. here are other options to read multiple Parquet files in Azure without using Spark pools. Here are a few methods: Method 1: Using Azure Synapse Serverless SQL pool Azure Synapse Serverless SQL pool allows you to query Parquet files directly. You can use the OPENROWSET function to read the content of Parquet files by providing the URL to your file. Here’s an example:

SELECT * FROM OPENROWSET( BULK 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/', FORMAT='PARQUET' ) AS [result]

Please replace 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/' with the actual path to your Parquet files. **

Method 2: Using Azure Data Factory** Azure Data Factory can load multiple files in parallel. You can create a metadata-driven pipeline where you will load multiple types of flat file dynamically Hope this helps. Do let us know if you any further queries.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-15T05:21:04.5566667+00:00

@Abhinesh Kumar Lal Karn We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-02-16T10:33:56.2266667+00:00

@Abhinesh Kumar Lal Karn just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Amira Bedhiafi 34,101 Volunteer Moderator

If you are using Spark pools in Azure Synapse, you can easily read multiple Parquet files by specifying the directory path or using a wildcard pattern in the path. The Spark DataFrame API allows you to read all Parquet files in a specified directory or match a pattern.

# Read multiple Parquet files from a directory
df = spark.read.parquet("/path/to/your/directory/")
# Or using a wildcard pattern to match specific files
df = spark.read.parquet("/path/to/your/directory/prefix*.parquet")

Another approach is reading multiple Parquet files with SQL on-demand. You generally specify the folder path, and Azure Synapse automatically reads all Parquet files within that folder. However, using wildcards directly in the OPENROWSET function isn't supported in the same way as in Spark. Instead, you specify the directory, and it processes all Parquet files in that directory.

SELECT *
FROM OPENROWSET(
    BULK 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/',
    FORMAT='PARQUET'
) AS [result]

Abhinesh Kumar Lal Karn 0 Reputation points

2024-02-13T10:05:30.45+00:00

We don't have option for spark pool. Need to use in the pipeline then we have any other option?
Amira Bedhiafi 34,101 Reputation points Volunteer Moderator

2024-02-15T09:37:31.6866667+00:00

You can connect to Parquet files in Dataflow Gen2 using the Parquet connector provided by Data Factory in Microsoft Fabric. https://learn.microsoft.com/en-us/fabric/data-factory/connector-parquet-dataflows
Amira Bedhiafi 34,101 Reputation points Volunteer Moderator

2024-02-15T09:53:49.1533333+00:00

This could happen if there's a typo in the path, if the path points to a non-existent or empty directory, or if there's a mismatch between where data is written and where the table is expected to read from.

Share via

Reading multiple parquet files in synapse via wildcard in synapse?

1 answer

Your answer