FileNotFoundException when accessing abfss on databricks

Millet, Aymeric (KACES) 0 Reputation points
2023-02-09T15:18:09.4533333+00:00

Hello,

I have a strange behavior that when checking existence of a directory on an azure storage using the abfss connector.

I am using the sample code below:

import org.apache.hadoop.fs._
import org.apache.hadoop.conf.Configuration

val dir = "abfss://<my_container>@<my_storage_account>.dfs.core.windows.net/data/fmdp/stream-service/output/"

val path = new Path(dir)
val fs2 = path.getFileSystem(spark.sparkContext.hadoopConfiguration) 
fs2.listStatus(path)    <-- This is OK

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(path)     <-- This is KO (throw FileNotFoundException)

Spark configuration to access the Azure storage is set correctly and "fs2.listStatus(path)" returns a correct value.

But "fs.listStatus(path)" is throwing a FileNotFoundException (error is "FileNotFoundException: File /1140974070106206/data/fmdp/stream-service/output does not exist." and path in the error message is not correct - abfss://<my_container>@<my_storage_account>.dfs.core.windows.net replaced by /1140974070106206) ?

Why ? what is the difference between "FileSystem.get" and "path.getFileSystem" ?

One is working and not the other !!!!

Unfortunatly, we are trying to port existing Spark java code to databricks and this code is using FileSystem.get and it is failing to check the existence of files/directories !!! I try to understand why.

Regards

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,527 questions
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2023-02-14T00:26:02.3466667+00:00

    Hi Millet, Aymeric (KACES),

    Welcome to Microsoft Q&A forum and thanks for posting your query.

    The difference between the two methods is that "FileSystem.get" uses the default configuration (static) to access the Azure storage, while "path.getFileSystem" uses the configuration (custom) specified in the path object.

    In your case, it seems like the default configuration is not set correctly, which is why it is throwing a FileNotFoundException. To resolve the issue, please make sure that the default configuration is set correctly to access the Azure storage. If the default configuration is not set correctly, you may encounter errors such as FileNotFoundException. It is recommended to use "path.getFileSystem" instead of "FileSystem.get" when working with the abfss connector, as it uses the configuration specified in the "spark.sparkContext.hadoopConfiguration" object. This ensures that the correct configuration is used to access the Azure storage.

    Given that the path.getFileSystem() method is working and the FileSystem.get() is not, it is likely that the issue is due to the configuration settings used. You can try to fix this issue by setting the same or similar configuration settings for both methods. (Please try to use the path settings as your default settings and see if that helps to resolve the issue)

    For more helpful info, please refer to this SO thread: Hadoop: Path.getFileSystem vs FileSystem.get

    Hope this info helps.


    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.