Hi John,
When seeing your issue it looks like you're encountering with the 403 error when trying to use the fs.exists()
method on Azure Data Lake Storage Gen2 (ADLS Gen2) using a Shared Access Signature (SAS) token in your Spark session seems to stem from how the SAS token is being applied (or not applied) for the Hadoop FileSystem operations.
In the first part of your code, where you're reading data from ADLS Gen2, you're directly setting the SAS token configuration on the Spark session configuration. This works well for Spark's built-in data source APIs (like spark.read.parquet
) because Spark correctly uses these configurations to authenticate requests to Azure Blob Storage.
However, when you attempt to use the Hadoop FileSystem API directly (like fs.exists()
), the SAS token configuration needs to be explicitly set in the Hadoop Configuration context used by the Spark session. This is because the direct Hadoop FileSystem operations might not automatically pick up the SAS token settings from the Spark configuration.
Possible Solution
To ensure that the Hadoop FileSystem API can authenticate using the SAS token, you should explicitly set the necessary configuration properties in the Hadoop Configuration context. Specifically, you need to ensure that the SAS token is properly configured for use by the Hadoop FileSystem API. You can try setting the SAS token directly in the Hadoop configuration, similar to how you've set it in the Spark configuration: hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hadoop_conf.set("fs.azure.sas.token.provider.type.gsdevtest.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
hadoop_conf.set("fs.azure.sas.fixed.token.gsdevtest.dfs.core.windows.net", "<your_sas_token>")
Make sure you replace <your_sas_token>
with your actual SAS token, keeping in mind to not expose your SAS token in your code for security reasons.
Additional Steps
- Ensure Correct Scope: Verify that the SAS token has the appropriate permissions for the operations you are attempting to perform. For
fs.exists()
, at minimum, read permissions on the directory or file are necessary. - Check SAS Token Expiry: Ensure that the SAS token hasn't expired. An expired token will result in a 403 error.
- URL Encoding: Confirm that the SAS token is correctly URL-encoded. Sometimes, issues can arise if special characters in the SAS token are not properly encoded.
By explicitly setting the SAS token within the Hadoop configuration context used by your Spark session, you should be able to authenticate the fs.exists()
call and other Hadoop FileSystem operations against ADLS Gen2.
I hope this helps, if you have any questions please let me know.