Error CONTEXT_ONLY_VALID_ON_DRIVER] I DATABRICKS

Question

Error CONTEXT_ONLY_VALID_ON_DRIVER] I DATABRICKS

Shambhu Rai 1,411

Hi Expert, am using below udf function in databricks merge condition but getting error 'pyspark.errors.exceptions.base.PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.', from <command-2591425064644209>, line 10. Full traceback below:

function attached in notepad

s   
                      
  """)

Shambhu Rai 1,411 Reputation points

2024-01-16T01:06:53.5+00:00

Suggestion please

1 answer

Your answer

Shambhu Rai 1,411 Reputation points

2024-01-16T01:06:53.5+00:00

Suggestion please

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

Shambhu Rai - Thanks for the question and using MS Q&A platform.

The error message indicates that you are trying to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers.

In your code, it seems that you are passing spark object to the ufnGetGlobalSiteIdForTxns function which is causing the error. You should not pass spark object to the function. Instead, you can use the spark object inside the function to create a SparkSession object and use it to perform the required operations.

Here's an updated version of your code:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def get_global_site_id(pickup_company, supply_country, glops_los_no_x, external_los_no):
    spark_session = SparkSession.builder.getOrCreate()
    result = spark_session.sql("SELECT ufnGetGlobalSiteIdForTxns('{}', '{}', 'CDX', '{}', '{}')".format(pickup_company, supply_country, glops_los_no_x, external_los_no)).collect()[0][0]
    return result

udf_get_global_site_id = udf(get_global_site_id, StringType())

spark.sql("""
MERGE into ErrorDeleted AS target                      
    USING (select *, udf_get_global_site_id(CHF_PICKUP_COMPANY, CHF_SUPPLY_COUNTRY, CHF_GLOPS_LOS_NO_X, CHF_EXTERNAL_LOS_NO) as GlobalSiteid from source_query) AS source                      
    ON (target.txnguid=source.txnguid)                      
    WHEN MATCHED THEN                      
    UPDATE SET                       
        GlobalSiteid=source.GlobalSiteid
""")

In this updated code, we have created a new function get_global_site_id which takes the required parameters and uses the SparkSession object to execute the SQL query. We have also created a UDF udf_get_global_site_id which uses the get_global_site_id function. Finally, we have used the UDF in the MERGE statement to update the GlobalSiteid column.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Shambhu Rai 1,411 Reputation points

2024-01-16T02:28:36.27+00:00

But I have 2 if conditions how I can I add
Shambhu Rai 1,411 Reputation points

2024-01-16T02:56:33.6966667+00:00

error: [UNRESOLVED_ROUTINE] Cannot resolve function udf_get_global_site_id on search path [system.builtin, system.session, spark_catalog.default].; line 3 pos 21 please suggest the changes after checking

PRADEEPCHEEKATLA 90,641 Moderator

Shambhu Rai -

To add multiple if conditions, you can modify the get_global_site_id function to include the additional conditions. Here's an updated version of the function:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def get_global_site_id(pickup_company, supply_country, glops_los_no_x, external_los_no):
    spark_session = SparkSession.builder.getOrCreate()
    if pickup_company is None or supply_country is None or glops_los_no_x is None or external_los_no is None:
        return None
    else:
        result = spark_session.sql("SELECT ufnGetGlobalSiteIdForTxns('{}', '{}', 'CDX', '{}', '{}') as global_site_id".format(pickup_company, supply_country, glops_los_no_x, external_los_no)).collect()[0]['global_site_id']
        return result

udf_get_global_site_id = udf(get_global_site_id, StringType())

source_query = spark.table("source_query")
target = spark.table("ErrorDeleted")

target = target.alias("target")
source_query = source_query.alias("source")

merged_df = target.join(source_query, target.txnguid == source_query.txnguid, "inner") \
    .select(target["*"], source_query["*"], udf_get_global_site_id("CHF_PICKUP_COMPANY", "CHF_SUPPLY_COUNTRY", "CHF_GLOPS_LOS_NO_X", "CHF_EXTERNAL_LOS_NO").alias("GlobalSiteid"))

merged_df.write.mode("overwrite").saveAsTable("ErrorDeleted")

In this updated code, I have added an additional if condition to check if any of the input parameters are None. If any of the parameters are None, the function returns None. Otherwise, it executes the SQL query and returns the result. Regarding the error message you are seeing, it seems that the UDF udf_get_global_site_id is not defined. Please make sure that you have defined the UDF before using it in the code.

Shambhu Rai 1,411 Reputation points

2024-01-19T04:17:39.0733333+00:00

above solution still giving in error as it is delta sql merge condition and your merge condition is not working
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-01-24T05:09:44.76+00:00

Shambhu Rai - Could you please share the stacktrace of the error message?

Share via

Error CONTEXT_ONLY_VALID_ON_DRIVER] I DATABRICKS

1 answer

Your answer