Partition pruning in Synapse serverless on Deltalake

Shlomi Lanton 76 Reputation points
2023-11-09T07:34:34.99+00:00

Hey,

I have a storage account with 2 containers:

  1. Big data with many Parquet files without any special format
  2. Big data in Parquet files with Delta format (with "_delta_log" folder)

I'm trying to query the data using Synapse, I created an EXTERNAL DATA SOURCE that pints to container 2 (see above) and a VIEW with OPENROWSET that uses:

  • The new external data source
  • BULK '/'
  • FORMAT = 'delta'

I'm able to query the data and also see my partitions, but partition pruning is not working. Even when I filter (with WHERE clause) on the partition column Synapse is scanning the entire table.

What I'm missing here?

Thanks.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
0 comments No comments
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-11-17T01:53:36.0533333+00:00

    @Shlomi Lanton Yes, that could be the issue. Please ensure partition column is of a supported data type, such as integer or date.

    To enable partition pruning, please try changing the data type of the partition column to a supported data type, such as date as explained in example provide in below blog or the public document:

    1. Alternatives in Serverless SQL Pools for Delta
      Example view query:
    CREATE VIEW LDW.vwWebTelemetryDelta
    AS
    SELECT 
        UserID,
        EventType,
        ProductID,
        [URL] AS ProductURL,
        Device,
        SessionViewSeconds,
        EventYear,
        EventMonth,
        EventDate
    FROM
     OPENROWSET
    (
        BULK 'cleansed/webtelemetrydeltatwo',
        DATA_SOURCE = 'ExternalDataSourceDataLakeUKMI',
        FORMAT = 'DELTA'
    ) 
    WITH
    (
        UserID INT,
        EventType VARCHAR(20),
        ProductID SMALLINT,
        URL VARCHAR(25),
        Device VARCHAR(10),
        SessionViewSeconds INT,
        EventYear SMALLINT,
        EventMonth TINYINT,
        EventDate DATE
    ) AS fct
    
    1. Query partitioned data
      User's image

    Hope this info helps. Let us know how it goes.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.


1 additional answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-11-09T16:53:14.8333333+00:00

    Can you verify that the columns you use in your WHERE clause match the columns used for partitioning in Delta Lake. Even a small discrepancy can lead to a full table scan.

    Check the _delta_log folder to confirm that the metadata accurately reflects the current partitioning scheme.

    Please check and revert to us !


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.