While calling spark notebook from synapse pipeline it takes more time to start up the spark pool

Heta Desai 357 Reputation points
2022-05-23T17:49:26.753+00:00

I have created Synapse pipeline where inside for each loop spark notebook is executing for different objects. The operations that are performing inside are merging two tables or query that joins two tables.

There is one parent pipeline from which child pipelines are executing which contains the spark notebook activity. While executing the pipeline I have noticed that notebook activity execution taking much time although there is no complex data processing and data volume is very small.

My question is each time spark pool gets started when notebook activity gets executed inside for each loop ?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
0 comments No comments
{count} votes

Accepted answer
  1. ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator
    2022-05-25T17:18:06.93+00:00

    Hi @Anonymous ,

    Thank you for posting query in Microsoft Q&A Platform.

    The Synapse notebook activity runs on the Spark pool that gets chosen in the Synapse notebook. When we run notebook activity, spark pool takes time to start spark session. Once spark sessions starts thats when data processing will actually gets trigger.

    You can select an Apache Spark pool in the settings. It should be noted that the Apache spark pool set here will replace the Apache spark pool used in the notebook. If Apache spark pool is not selected in the settings of notebook content for current activity, the Apache spark pool selected in that notebook will be used to run.

    205604-image.png

    If we have multiple notebooks which you would like to run. Then try to chain them within the notebook. That means from notebook call another notebook. That way we wont end up using multiple synapse notebook activities and we don't end up taking more time for spark session to start every time.

    We can use %run <notebook> or mssparkutils.notebook.run("<notebook>") to run one notebook from another.

    Below are few useful videos to explain about above commands:

    Hope this helps. Please let us know if any further queries.

    -------------

    Please consider hitting Accept Answer. Accepted answers help community as well.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.