Share via

Synapse Spark pipeline - choosing cluster

Ryan Abbey 1,186 Reputation points
2021-08-31T23:18:30.223+00:00

Say I have some small files and some very large files all being processed via Synapse pipeline calls to Spark... how do I say the small files should run on a small cluster and the very large files on a bigger cluster? There does not seem to be much available on where spark sessions should run

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.

0 comments No comments

Answer accepted by question author
  1. PRADEEPCHEEKATLA 91,861 Reputation points
    2021-09-01T08:56:13.197+00:00

    Hello @Ryan Abbey ,

    Thanks for the question and using MS Q&A platform.

    Unfortunately, there is no built-in mechanism to prioritize the jobs based on the file sizes.

    Azure Synapse provides this feature out of box in Apache Spark pools.

    Apache Spark pools provide the ability to automatically scale up and down compute resources based on the amount of activity.

    • When the autoscale feature is enabled, you can set the minimum and maximum number of nodes to scale.
    • When the autoscale feature is disabled, the number of nodes set will remain fixed.

    For more details, refer to Apache Spark pool configurations in Azure Synapse Analytics and Automatically scale Azure Synapse Analytics Apache Spark pools.

    Hope this helps. Do let us know if you any further queries.

    ---------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.