Azure Synapse - performance issues when working with 2 notebooks

Michael Urinovsky 11 Reputation points
2022-03-31T13:34:02.693+00:00

Hello,
I am facing a performance issues in the following scenario:
I have 1 spark pool that is used by 2 PySpark notebooks. Actions in Notebook #1 run properly, but in Notebook #2 executions get stuck on a very long Session initializations. The cold start takes usually ~3min but in my case it takes even 7 or more minutes. When I see that it gets stuck then I just restart the Spark Pool.
Please advise what can be done to prevent or properly handle such performance issues.
Thank you

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,476 questions
{count} vote

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 79,551 Reputation points Microsoft Employee
    2022-04-04T05:55:32.047+00:00

    Hello @Michael Urinovsky ,

    Thanks for the question and using MS Q&A platform.

    Unfortunately, Azure Synapse Analytics - Spark Pool do not support sharing clusters with multiple users for interactive use cases.

    Just to demo, I had created a two notebooks named Logging and Logging_Copy1 and tried to run both notebooks. As you observe it creates a single Apache Spark application to run the single notebook. If you are trying to run two notebooks on the same cluster - it will create two spark application which runs the notebooks seperately.

    189642-synapse-cluster-notebook.gif

    Reason: Azure Synapse provides purpose-built engines for specific use cases. Apache Spark for Synapse is designed as a job service and not a cluster model. There are two scenarios where people ask for a multi-user cluster model.

    Scenario #1: Many users accessing a cluster for serving data for BI purposes.

    The easiest way of accomplishing this task is to cook the data with Spark and then take advantage of the serving capabilities of Synapse SQL to that they can connect Power BI to those datasets.

    Scenario #2: Having multiple developers on a single cluster to save money.

    To satisfy this scenario, you should give each developer a serverless Spark pool that is set to use a small number of Spark resources. Since serverless Spark pools don’t cost anything, until they are actively used minimizes the cost when there are multiple developers. The pools share metadata (Spark tables) so they can easily work with each other.

    Spark instances are created when you connect to a Spark pool, create a session, and run a job. As multiple users may have access to a single Spark pool, a new Spark instance is created for each user that connects.

    For more details, refer to Azure Synapse Analytics frequently asked questions and Apache Spark in Azure Synapse Analytics Core Concepts - Examples.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.