Synapse Spark pipeline - choosing cluster

Ryan Abbey 1,181 Reputation points
2021-08-31T23:18:30.223+00:00

Say I have some small files and some very large files all being processed via Synapse pipeline calls to Spark... how do I say the small files should run on a small cluster and the very large files on a bigger cluster? There does not seem to be much available on where spark sessions should run

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,065 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA 90,266 Reputation points
    2021-09-01T08:56:13.197+00:00

    Hello @Ryan Abbey ,

    Thanks for the question and using MS Q&A platform.

    Unfortunately, there is no built-in mechanism to prioritize the jobs based on the file sizes.

    Azure Synapse provides this feature out of box in Apache Spark pools.

    Apache Spark pools provide the ability to automatically scale up and down compute resources based on the amount of activity.

    • When the autoscale feature is enabled, you can set the minimum and maximum number of nodes to scale.
    • When the autoscale feature is disabled, the number of nodes set will remain fixed.

    For more details, refer to Apache Spark pool configurations in Azure Synapse Analytics and Automatically scale Azure Synapse Analytics Apache Spark pools.

    Hope this helps. Do let us know if you any further queries.

    ---------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.