Azure databricks PySpark custom UDF ModuleNotFoundError: No module named

Question

I have the current repo on azure databricks:

|-run_pipeline.py  
|-__init__.py  
|-data_science  
|--__init.py__  
|--text_cleaning  
|---text_cleaning.py  
|---__init.py__

On the run_pipeline notebook I have this

from data_science.text_cleaning import text_cleaning  
path = os.path.join(os.path.dirname(__file__), os.pardir)  
sys.path.append(path)  
spark = SparkSession.builder.master(  
    "local[*]").appName('workflow').getOrCreate()  
  
df = text_cleaning.basic_clean(spark_df)

On the text_cleaning.py I have a function called basic_clean that will run something like this:

 def basic_clean(df):  
    print('Removing links')  
    udf_remove_links = udf(_remove_links, StringType())  
    df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))  
    return df

When I do df.show() on the run_pipeline notebook, I get this error message:

Exception has occurred: PythonException       (note: full exception trace is shown but execution is paused at: )  
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):  
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length  
    return self.loads(obj)  
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads  
    return pickle.loads(obj, encoding=encoding)  
ModuleNotFoundError: No module named 'data_science''. Full traceback below:  
Traceback (most recent call last):  
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length  
    return self.loads(obj)  
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads  
    return pickle.loads(obj, encoding=encoding)  
ModuleNotFoundError: No module named 'data_science'

Shouldnt the imports work? Why is this an issue?

Answer

Hi @roo ,

Thank you for posting query in Microsoft Q&A Platform.

from the error message it seems data_science module is not found, which is getting referred in your code. You double confirm by running command pip list to see installed libraries on cluster.

Kindly consider installing the modules when are needed to code work. Click here know how to install libraries or modules to cluster.

You consider running pip install data_science command in notebook cell as well.

Hope this helps. Please let me know if any further queries.

----------------------

Please consider hitting Accept Answer button. Accepted answers help community as well.

Azure databricks PySpark custom UDF ModuleNotFoundError: No module named

1 answer