ADLS access failed using Spark HDFS

P, John 240 Reputation points
2024-02-20T05:11:36.1933333+00:00

Right now, I am using the SAS token to access our data stored on ADLS Gen2 Storage: spark.conf.set("fs.azure.account.auth.type.gsdevtest.dfs.core.windows.net", "SAS") spark.conf.set("fs.azure.sas.token.provider.type.gsdevtest.dfs.core.windows.net","org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider") spark.conf.set("fs.azure.sas.fixed.token.gsdevtest.dfs.core.windows.net", "”) df = spark.read.option("ignoreCorruptFiles","true").parquet("abfss://[@gsdevtest.dfs.core.windows.net]/data/ehits/2024/2/10.parquet/") df.printSchema() df.show(10) The above code run fine, and I can see the rows of the data frame. Then I want to add the Hadoop File System manager with ADLS: sc= spark.sparkContext sc._jsc.hadoopConfiguration().set('fs.defaultFS',  'abfss://[@gsdevtest.dfs.core.windows.net]/') fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())) adls_path='abfss://[**@gsdevtest.dfs.core.windows.net]/' directory_name ='data/' directory_path = adls_path + directory_name print(directory_path) dir_status=fs.exists(sc._jvm.org.apache.hadoop.fs.Path(directory_path))   The above code gave me the following 403 error for the fs.exists() call: Py4JJavaError: An error occurred while calling o531.exists. : Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://gsdevtest.dfs.core.windows.net/gsdb/data?upn=false&action=getStatus&timeout=90 Here I have the same SAS configuraton, First part of the code clearly shows I can read data without problem. Why did Azure reject my access with the same SAS configuration later on?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
0 comments No comments
{count} votes

Accepted answer
  1. RevelinoB 3,675 Reputation points
    2024-02-20T05:30:08.59+00:00

    Hi John,

    When seeing your issue it looks like you're encountering with the 403 error when trying to use the fs.exists() method on Azure Data Lake Storage Gen2 (ADLS Gen2) using a Shared Access Signature (SAS) token in your Spark session seems to stem from how the SAS token is being applied (or not applied) for the Hadoop FileSystem operations.

    In the first part of your code, where you're reading data from ADLS Gen2, you're directly setting the SAS token configuration on the Spark session configuration. This works well for Spark's built-in data source APIs (like spark.read.parquet) because Spark correctly uses these configurations to authenticate requests to Azure Blob Storage. However, when you attempt to use the Hadoop FileSystem API directly (like fs.exists()), the SAS token configuration needs to be explicitly set in the Hadoop Configuration context used by the Spark session. This is because the direct Hadoop FileSystem operations might not automatically pick up the SAS token settings from the Spark configuration. Possible Solution

    To ensure that the Hadoop FileSystem API can authenticate using the SAS token, you should explicitly set the necessary configuration properties in the Hadoop Configuration context. Specifically, you need to ensure that the SAS token is properly configured for use by the Hadoop FileSystem API. You can try setting the SAS token directly in the Hadoop configuration, similar to how you've set it in the Spark configuration: hadoop_conf = spark._jsc.hadoopConfiguration()

    hadoop_conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") hadoop_conf.set("fs.azure.sas.token.provider.type.gsdevtest.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider") hadoop_conf.set("fs.azure.sas.fixed.token.gsdevtest.dfs.core.windows.net", "<your_sas_token>") Make sure you replace <your_sas_token> with your actual SAS token, keeping in mind to not expose your SAS token in your code for security reasons.

    Additional Steps

    • Ensure Correct Scope: Verify that the SAS token has the appropriate permissions for the operations you are attempting to perform. For fs.exists(), at minimum, read permissions on the directory or file are necessary.
    • Check SAS Token Expiry: Ensure that the SAS token hasn't expired. An expired token will result in a 403 error.
    • URL Encoding: Confirm that the SAS token is correctly URL-encoded. Sometimes, issues can arise if special characters in the SAS token are not properly encoded.

    By explicitly setting the SAS token within the Hadoop configuration context used by your Spark session, you should be able to authenticate the fs.exists() call and other Hadoop FileSystem operations against ADLS Gen2. I hope this helps, if you have any questions please let me know.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. P, John 240 Reputation points
    2024-02-20T19:49:53.9633333+00:00

    These document are really helpful! Thanks! One last question: If I mount the ADLS Gen2 storage to the databricks (say dbfs:/mnt/mystorage/), Can I treat the mounted storage as local drive and access files with plain vanilla python code? If so, why do people want to use Hadoop File system API? Is it for performance?

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.