How to create and use separate Spark sessions/apps within the same notebook session?

Martin B 121 Reputation points
2024-08-27T21:31:16.23+00:00

We process data in various logical steps from raw to curated where we persist intermediate results for each of the steps and the following steps starts with reading the predecessor results (similar to medallion architecture pattern). This allows us to pause processing, have "checkpoints" if anything goes wrong and resume if needed from each of the steps. In total we do have four high-level steps (with lots of transformations in it).

Our first idea was to have every step run in a single notebook. But since spinning up a notebook takes a couple of minutes in Synapse we ended up running all of the steps within the same notebook session (and using a single Spark session). We think that this way we will get the best runtime/cost-efficiency since we do not need to wait between the steps until a new notebook session is created (and pay for the ramp-up-time).

This works fine from an execution perspective, but it comes with one huge downside: we can't use the SparkUI any more properly, because so many Spark jobs (> 4000) are executed and its hard to comprehend which Spark job corresponds to which step.

I was wondering, if it would be possible to still run everything within the same notebook session, but to create a new Spark session/Spark app every time a step is completed. Basically something like this:

# Step 1
df1 = spark.read.[...]
df1 = transform1(df1)
df1.write.[...]
spark.stop()

# Step 2
spark = SparkSession.builder.appName("step2").[...].getOrCreate()
df2 = spark.read.[...]
df2 = transform2(df2)
df2.write.[...]
spark.stop()

# Step 3
spark = SparkSession.builder.appName("step3").[...].getOrCreate()
df3 = spark.read.[...]
df3 = transform3(df3)
df3.write.[...]
spark.stop()

# final termination of the notebook session
mssparkutils.session.stop()

When visiting the Spark History Server, this lists three Spark applications - just as I hoped for, but when navigating to one of the custom created Spark sessions and selecting one of the detail tabs (eg. "SQL" or "Storage"), it only shows an generic error No platform found for URL: ...

So, I have various questions:

  1. Does the overall idea of having separate Spark sessions/apps within the same notebook session make any sense for increasing tractability / monitoring?
  2. Does Synapse support this? If so, ...
    1. what do I need to do so that my custom Spark session also shows up properly in SparkUI later on?
    2. How do I create a new spark session that basically is configured just as the Spark session that is referenced by the spark variable automatically after start of a notebook session?
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,994 questions
{count} votes

Accepted answer
  1. Bhargava-MSFT 31,116 Reputation points Microsoft Employee
    2024-08-28T00:38:45.4266667+00:00

    Hello Martin B,

    Synapse creates a Spark session automatically when you start a notebook. This session is managed by the Synapse service and is typically intended to last for the entire duration of the notebook execution. Stopping and restarting the Spark session using spark.stop() in your code may disrupt the integration between the notebook and the underlying Synapse service, leading to incomplete or missing UI elements in the Spark History Server.

    I personally saw errors like driver is running into out of memory situations when running spark.stop()

    Alternatives:

    The cleanest way to separate your Spark jobs would be to use different notebooks or pipelines for each step. This ensures each step runs as a distinct Spark application, with its own Spark session and context. Yes, there's a startup cost, but this approach simplifies monitoring, logging, and troubleshooting. You can orchestrate these notebooks in an Azure Synapse Pipeline if you want to maintain the overall flow and use built-in monitoring features

    If using separate notebooks isn't an option due to startup time concerns, another approach could be to manage Spark jobs within a single session by grouping related jobs into stages.

    Please see the below article explains how multiple SparkSessions can be created under one SparkContext.

    https://www.waitingforcode.com/apache-spark-sql/multiple-sparksession-one-sparkcontext/read

    also, the below Microsoft document provides guidance on how Spark is managed within Synapse:

    https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks

    I hope this helps.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.