Reading multiple parquet files in synapse via wildcard in synapse?

Abhinesh Kumar Lal Karn 0 Reputation points
2024-02-13T09:50:30.14+00:00

Unable to read multiple parquet file via integrated datasets? User's image

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,652 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 34,101 Reputation points Volunteer Moderator
    2024-02-13T10:01:24.2233333+00:00

    If you are using Spark pools in Azure Synapse, you can easily read multiple Parquet files by specifying the directory path or using a wildcard pattern in the path. The Spark DataFrame API allows you to read all Parquet files in a specified directory or match a pattern.

    # Read multiple Parquet files from a directory
    df = spark.read.parquet("/path/to/your/directory/")
    # Or using a wildcard pattern to match specific files
    df = spark.read.parquet("/path/to/your/directory/prefix*.parquet")
    

    Another approach is reading multiple Parquet files with SQL on-demand. You generally specify the folder path, and Azure Synapse automatically reads all Parquet files within that folder. However, using wildcards directly in the OPENROWSET function isn't supported in the same way as in Spark. Instead, you specify the directory, and it processes all Parquet files in that directory.

    SELECT *
    FROM OPENROWSET(
        BULK 'https://yourstorageaccount.dfs.core.windows.net/yourfilesystem/path/to/your/directory/',
        FORMAT='PARQUET'
    ) AS [result]
    
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.