AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data;

Mayuri Kadam 121 Reputation points
2021-06-30T23:28:53.013+00:00

I am packing the following code in a whl file:

from pkg_resources import resource_filename
def path_to_model(anomaly_dir_name: str, data_path: str):
    filepath = resource_filename(anomaly_dir_name, data_path)
    return filepath
def read_data(spark) -> DataFrame:
    return (spark.read.parquet(str(path_to_model("sampleFolder", "data"))))

I confirmed that the whl file contains the parquet files under sampleFolder/data/ directory correctly. When i run this locally it works, but when i upload this whl file to dbfs and run then i get this error:

AnalysisException: Path does not exist: dbfs:/databricks/python/lib/python3.7/site-packages/sampleFolder/data;

I confirmed that this directory actually does not exist: dbfs:/databricks/python Any idea what this error could be?

Thanks.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. Mayuri Kadam 121 Reputation points
    2021-07-09T20:45:57.473+00:00

    the issue was 2-folds:

    1. When you package a python module in a whl file and deploy to databricks job, to access any data files within the whl files using spark, you need to specify the scheme ‘file:’. If left unspecified the spark automatically appends ‘dbfs:’ by default and tries to find the data files in dbfs eventually which do not exist. We need to make it search locally within the whl file.
    2. While using UDFs in python whl, do not use ‘decorators’. Decorators work well when testing in notebook as spark session is already available to you. But while testing in whl it does not work and fails at runtime as spark session gets initialized later and the UDF syntax is parsed first.
    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2021-07-01T09:05:21.66+00:00

    Hello @Mayuri Kadam ,

    Thanks for the question and using MS Q&A platform.

    You are experiencing this error message because the path doesn't exists.

    Make sure you have upload a file to DBFS, and pass the exact path of whl file.

    Spark API Format - dbfs:/sampleFolder/data  
    File API Format - /dbfs/sampleFolder/data  
    

    You may checkout the answer provided by @Alex Aguilar on your SO thread.

    Hope this helps. Do let us know if you any further queries.

    ---------------------------------------------------------------------------

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.