Databricks-Connect also return module not found for multiple python files job.

Dong Yuan 1 Reputation point
2020-07-09T23:26:37.607+00:00

Currently I'm connecting to databricks with local VS Code via databricks-connect. But my submmission all comes with error of module not found, which means the code in other python files not found.
I tried:

Move code into the folder with main.py
import the file inside of the function that uses it
adding the file via sparkContext.addPyFile

Does anyone have any experiecen on it? Or the even better way to interact with databricks for python projects.

I seems my python part code is executed in local python env, only the code directlry related spark is in cluster, but the cluster does not load all my python files. then raising error.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,517 questions
{count} votes

6 answers

Sort by: Most helpful
  1. Kamel.ST 6 Reputation points
    2021-03-17T14:03:52.37+00:00

    I have the exact same issue despite of using databricks-connect 7.3 on the cluster version 7.3 LTS ML.

    Can anyone explain the root cause of such issue ? In the documentation it looks pretty straight forward since we just need to call the addPyFile function...

    Many thanks

    1 person found this answer helpful.
    0 comments No comments

  2. Jonas 26 Reputation points
    2020-07-10T17:46:55.703+00:00

    Hello,

    I experience the same problem, my setting is as follows:

    Databricks 6.6 Cluster
    Databricks-Connect 6.6
    All other dependencies and configuration unchanged since DB 6.2 when everything still worked fine (but the Runtime got depreciated so I had to update)

    I have a custom package with the following structure

    Package_Folder.zip
    Package_Folder
    init.py
    Modules_Folder
    init.py
    Custom_Module.py

    I added the ZIP-File to the local Databricks-Connect SparkContext via sc.addPyFile(Path_to_Package_Folder.zip)

    If I try to use a function from Custom_Module.py, I get the error

    ModuleNotFoundError: No module named 'Package_Folder'

    The error is embedded in a lot of error messages which are based on

    Exception has occurred: Py4JJavaError
    An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServeWithJobGroup.

    which is raised during a joblibspark call

    Do you have any idea what could cause this error? Thanks!


  3. Dong Yuan 1 Reputation point
    2020-07-10T22:40:09.84+00:00

    Detailed Error"

    Exception has occurred: Py4JJavaError
    An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
    File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
    return self.loads(obj)
    File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
    return pickle.loads(obj, encoding=encoding)
    ModuleNotFoundError: No module named 'lib222'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "/databricks/spark/python/pyspark/worker.py", line 462, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
    File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
    command = serializer._read_with_length(file)
    File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
    raise SerializationError("Caused by " + traceback.format_exc())
    pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
    File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
    return self.loads(obj)
    File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
    return pickle.loads(obj, encoding=encoding)
    ModuleNotFoundError: No module named 'lib222'

    0 comments No comments

  4. Jonas 26 Reputation points
    2020-08-19T16:41:08.103+00:00

    Some days ago, Databrick-Connect 7.1.0 was released which seems to have solved this issue. I don't know what caused or solved the problem, but if anybody runs into this problem, try to update to a Databricks Cluster with Runtime Version >= 7.1.0 and use the corresponding Databricks-Connect Version. @Dong Yuan , can you confirm this solution?

    0 comments No comments

  5. Gorka Esnal 1 Reputation point
    2021-06-25T09:17:52.573+00:00

    I have the same issue as well and I'm already using databricks-connect 7.3, is there any solution or workaround?

    Thanks,
    Gorka

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.