the issue was 2-folds:
- When you package a python module in a whl file and deploy to databricks job, to access any data files within the whl files using spark, you need to specify the scheme ‘file:’. If left unspecified the spark automatically appends ‘dbfs:’ by default and tries to find the data files in dbfs eventually which do not exist. We need to make it search locally within the whl file.
- While using UDFs in python whl, do not use ‘decorators’. Decorators work well when testing in notebook as spark session is already available to you. But while testing in whl it does not work and fails at runtime as spark session gets initialized later and the UDF syntax is parsed first.