Spark Optimizations - DeltaLake

Sivagnana Sundaram, Krithiga 31 Reputation points
2022-06-15T20:47:39.953+00:00

I am trying to do a bunch of transformations (multiple lookups) on 13 million records delta table (about 260 columns wide) in Synapse notebooks. It looks like the update on it takes about 30-40 mins. The spark sessions uses about 8vCores and 56GB Memory executors (range of 1 to 5 executors).

I was just wondering if this resource size is considerable to process the 13 million records?. From the spark logs, it looks like the default partitioning is resulting in skew. I have tried salting techniques to reduce the skew, but isn’t helping much.

Could you please point me to some suggestions/resources that could help with the optimizations ?.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,253 questions
{count} votes

1 answer

Sort by: Most helpful
  1. AnnuKumari-MSFT 30,101 Reputation points Microsoft Employee
    2022-06-22T05:41:20.69+00:00

    Hi @Sivagnana Sundaram, Krithiga ,

    We got the response from product team on the above issue. Kindly have a look :

    As per the resource size question, it is hard to tell, since number of executors in dynamically allocated cluster depends on the transformation’s logic.

    Regarding the data skew, in certain cases, Spark 3 (available on Synapse) can handle data skey OOB. Please refer to this documentation. Please check if one of the proposed solutions in the documentation can be used to fix the data skew in your case."

    ----------------------------------

    Please do consider clicking Accept Answer and Up-Vote for the same as accepted answers help community as well.