Azure databricks PySpark custom UDF ModuleNotFoundError: No module named

roo 1 Reputation point
2022-12-01T19:09:22.907+00:00

I have the current repo on azure databricks:

|-run_pipeline.py  
|-__init__.py  
|-data_science  
|--__init.py__  
|--text_cleaning  
|---text_cleaning.py  
|---__init.py__  

On the run_pipeline notebook I have this

from data_science.text_cleaning import text_cleaning  
path = os.path.join(os.path.dirname(__file__), os.pardir)  
sys.path.append(path)  
spark = SparkSession.builder.master(  
    "local[*]").appName('workflow').getOrCreate()  
  
df = text_cleaning.basic_clean(spark_df)  

On the text_cleaning.py I have a function called basic_clean that will run something like this:

 def basic_clean(df):  
    print('Removing links')  
    udf_remove_links = udf(_remove_links, StringType())  
    df = df.withColumn("cleaned_message", udf_remove_links("cleaned_message"))  
    return df  

When I do df.show() on the run_pipeline notebook, I get this error message:

Exception has occurred: PythonException       (note: full exception trace is shown but execution is paused at: <module>)  
An exception was thrown from a UDF: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):  
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length  
    return self.loads(obj)  
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads  
    return pickle.loads(obj, encoding=encoding)  
ModuleNotFoundError: No module named 'data_science''. Full traceback below:  
Traceback (most recent call last):  
  File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length  
    return self.loads(obj)  
  File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads  
    return pickle.loads(obj, encoding=encoding)  
ModuleNotFoundError: No module named 'data_science'  

Shouldnt the imports work? Why is this an issue?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 37,896 Reputation points Microsoft Employee
    2022-12-05T05:50:07.21+00:00

    Hi @roo ,

    Thank you for posting query in Microsoft Q&A Platform.

    from the error message it seems data_science module is not found, which is getting referred in your code. You double confirm by running command pip list to see installed libraries on cluster.

    Kindly consider installing the modules when are needed to code work. Click here know how to install libraries or modules to cluster.

    You consider running pip install data_science command in notebook cell as well.
    267053-image.png

    Hope this helps. Please let me know if any further queries.

    ----------------------

    Please consider hitting Accept Answer button. Accepted answers help community as well.