SparkContext should only be created and accessed on the driver

Aravind Kumar Peddola 1 Reputation point
2022-07-25T21:49:37.227+00:00

Hi Team,
I am using Azure Databricks (10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)) Standard_L8s with cores.

when I am executing below code its giving Error SparkContext should only be created and accessed on the driver.
If I am using import pandas only its running fine, but it taking more than 3 hrs. to me, I have billons of records to process.
I have to tune this UDF please help in ths

import pyspark.pandas as pd
def getnearest_five_min_slot(valu):
dataframe = pd.DataFrame([300,600,900,1200,1500,1800,2100,2400,2700,3000,3300,3600], columns = ['value'])
rslt_df = dataframe.loc[dataframe['value'] >= value]
rslt_df=rslt_df.sort_values(by=['value'], ascending=[True]).head(1)
output=int(rslt_df.iat[0,0])
print('\nResult dataframe :\n', output)

return output
getnearestFiveMinSlot = udf(lambda m: getnearest_five_min_slot(m))

slotValue = [100,500,1100,400,601]
df = spark.createDataFrame(slotValue, IntegerType())
df=df.withColumn("NewValue",getnearestFiveMinSlot("value"))
display(df)

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,258 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,246 Reputation points
    2022-07-26T09:32:36.057+00:00

    Hello @Anonymous ,

    Thanks for the question and using MS Q&A platform.

    To clarify a bit more - in Spark, you can never use a SparkContext or SparkSession within a task / UDF. This has always been true.

    In Spark 3.0 and below, SparkContext can be created in executors. Since Spark 3.1, an exception will be thrown when creating SparkContext in executors. You can allow it by setting the configuration spark.executor.allowSparkContext when creating SparkContext in executors.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

  2. Manish 1 Reputation point
    2023-06-12T09:40:20.38+00:00

    I am working on synapse notebook, and also getting the same error , i am trying to use below code

    def get_file_details(file_path):
        # Get file details using mssparkutils.fs.ls
        file_details = mssparkutils.fs.ls(file_path)
        print(file_details)
        return file_details
    
    # Register the UDF
    get_file_details_udf = udf(lambda path:get_file_details(file_path))
    
    # Apply the UDF to the 'file_path' column to get file details
    df = df.withColumn("file_details", get_file_details_udf(col("file_path")))
    
    
    

    HOwever when i am trying to print dataframe but getting error message as

    PythonException: 
      An exception was thrown from the Python worker. Please see the stack trace below.
    Traceback (most recent call last):
      File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/notebookutils/__init__.py", line 4, in <module>
        from notebookutils.visualization import display, displayHTML, enableMatplotlib
      File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/notebookutils/visualization/__init__.py", line 1, in <module>
        from .display import display, display_mount_points
      File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/notebookutils/visualization/display.py", line 11, in <module>
        from notebookutils.common.logger import log4jLogger
      File "/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/notebookutils/common/logger.py", line 6, in <module>
        sc = SparkSession.Builder().getOrCreate().sparkContext
      File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate
        sc = SparkContext.getOrCreate(sparkConf)
      File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 484, in getOrCreate
        SparkContext(conf=conf or SparkConf())
      File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 188, in __init__
        SparkContext._assert_on_driver()
      File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 1545, in _assert_on_driver
        raise RuntimeError("SparkContext should only be created and accessed on the driver.")
    RuntimeError: SparkContext should only be created and accessed on the driver.
    

    i also tried creating the spark config as below but still getting the same error.

    # Create a SparkConf object and set the necessary configurations
    conf = SparkConf().set("spark.executor.allowSparkContext", "true")
    
    # Create a SparkSession with the configured SparkConf
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.