My python UDF is really buggy, after the runtime change from 16.3 to 16.4 LTS on Azure databricks

Tian Chen 0 Reputation points
2025-06-02T19:30:17.26+00:00

I have some job compute setup earlier this year. And a week ago all the python UDF that i have in the notebook has been really buggy. Sometime it will get stuck during execution. And the execute time have been go all over the places the same code sometime take 5s or 15 minutes to execute. When i run the same notebook on runtime 16.3, It's all good. But when we are running on 16.4 LTS it starting have have this issue.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2025-06-02T21:20:28.6133333+00:00

    Hello Tian,

    Thank you for posting on Microsoft Learn.

    Your issue is either related to some internal changes in how UDFs are executed or how the new runtime interacts with Python dependencies or Spark execution plans.

    Go to the Databricks Runtime 16.4 LTS Release Notes and look for:

    • Python UDF execution changes

    Python version upgrades (e.g., 3.10.x to 3.11.x)

    Internal API or serialization differences

    Behavior changes in Arrow or Pandas UDFs

    You can modify your UDF to include logging or print statements to identify whether the slowness is inside the UDF or in how Spark schedules tasks :

    def my_udf(x):
        import time
        start = time.time()
        result = ... 
        print(f"Processed {x} in {time.time() - start:.2f}s")
        return result
    

    Some UDFs may lazily import modules inside them. This becomes problematic with updated runtimes so you may need to move imports outside the UDF:

    import json
    def my_udf(x):
        return json.loads(x)
    

    If you're using @udf or F.udf(...), try re-registering it with explicit types:

    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    @udf(returnType=StringType())
    def my_udf(x):
        ...
    

    Another thing, If you are using Pandas UDFs, try disabling Arrow temporarily.

    If this fixes the issue, it's likely due to Arrow or PyArrow version mismatch or serialization bugs in the new runtime.

    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.