Hello Tian,
Thank you for posting on Microsoft Learn.
Your issue is either related to some internal changes in how UDFs are executed or how the new runtime interacts with Python dependencies or Spark execution plans.
Go to the Databricks Runtime 16.4 LTS Release Notes and look for:
- Python UDF execution changes
Python version upgrades (e.g., 3.10.x to 3.11.x)
Internal API or serialization differences
Behavior changes in Arrow or Pandas UDFs
You can modify your UDF to include logging or print statements to identify whether the slowness is inside the UDF or in how Spark schedules tasks :
def my_udf(x):
import time
start = time.time()
result = ...
print(f"Processed {x} in {time.time() - start:.2f}s")
return result
Some UDFs may lazily import modules inside them. This becomes problematic with updated runtimes so you may need to move imports outside the UDF:
import json
def my_udf(x):
return json.loads(x)
If you're using @udf
or F.udf(...)
, try re-registering it with explicit types:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def my_udf(x):
...
Another thing, If you are using Pandas UDFs, try disabling Arrow temporarily.
If this fixes the issue, it's likely due to Arrow or PyArrow version mismatch or serialization bugs in the new runtime.
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")