ADLS blob storage connect from pyspark fails as it couldn't find dependent jar's

Question

ADLS blob storage connect from pyspark fails as it couldn't find dependent jar's

SHYAMALA GOWRI 90

I am trying to execute this code in pyspark with below commands

pyspark --jars hadoop-azure-3.2.1.jar,azure-storage-8.6.4.jar, jetty-util-ajax-12.0.7.jar, jetty-util-12.0.7.jar (my spark version is 3.5.1)

and it fails with the following, need ur advice in organising the classpaths here

  File "/usr/local/Cellar/apache-spark/3.5.1/libexec/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/log/Log
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createPermissionJsonSerializer(AzureNativeFileSystemStore.java:406)
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.<clinit>(AzureNativeFileSystemStore.java:310)
	at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1404)
	at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1331)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)

my pyspark code is

from pyspark.sql import SparkSession
#Get spark session
spark = SparkSession.builder.appName("Test Spark App").getOrCreate()
#Set config params described above
spark.conf.set("fs.azure.sas.sparkadlsiae.blob.core.windows.net", "sv=2022-11-02&ss=bfqt&srt=o&sp=rwdlacupyx&se=2024-04-04T01:46:06Z&st=2024-04-03T17:46:06Z&spr=https&sig=abcd%3D")
#params
storage_account_name= "sparkadlsiae"
container_name = "pyspark"
file = "people.csv"
#Read from the adls location
path = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+file
spark.read.format("csv").load(path).show()

SHYAMALA GOWRI 90

I could get past issue after i added the jars like

pyspark --jars hadoop-azure-3.3.3.jar,azure-storage-7.0.1.jar,jetty-util-ajax-9.4.43.v20210629.jar,jetty-util-9.4.43.v20210629.jar

But not now i am getting the following error

Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for account sparkadlsiae.blob.core.windows.net in the configuration, and its container pyspark is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials.
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:899)
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1118)

I want to use SAS to authenticate my connection with storage container

SHYAMALA GOWRI 90

Thanks Anand for helping on this query.

I tried with the below code as per the thread that u have shared

from pyspark.sql import SparkSession
#Get spark session
spark = SparkSession.builder.appName("Test Spark App").getOrCreate()
#Set config params described above
blob_account_name = "sparkadlsiae"
blob_container_name = "pyspark"
blob_relative_path = "people.csv"
blob_sas_token = "sp=r&st=2024-04-08T21:44:55Z&se=2024-04-11T05:44:55Z&skoid=44f2fb6f-8b71-4dbc-a12d-6e1576637d87&sktid=0d909434-4142-4034-8b85-691ead3660b6&skt=2024-04-08T21:44:55Z&ske=2024-04-11T05:44:55Z&sks=b&skv=2022-11-02&spr=https&sv=2022-11-02&sr=c&sig=8XX2b3oQcSYaknw%2BhMWafQ3pF3SzuAbObYflje1N5mE%3D"
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
spark.read.format("csv").load(wasbs_path).show()

pyspark --jars hadoop-azure-3.3.3.jar,azure-storage-7.0.1.jar,jetty-util-ajax-9.4.43.v20210629.jar,jetty-util-9.4.43.v20210629.jar

and it gives the following error

24/04/09 14:04:16 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-azure-file-system.properties,hadoop-metrics2.properties
24/04/09 14:04:18 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: wasbs://******@sparkadlsiae.blob.core.windows.net/people.csv.
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This request is not authorized to perform this operation using this permission.
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.retrieveMetadata(AzureNativeFileSystemStore.java:2265)
	at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2741)

is there anyway i can validate my token is valid anywhere else other through this code.

I have tried repeatedly creating token by user who has admin priv but it still doesnt seem to work

Accepted answer

0 additional answers

Your answer

SHYAMALA GOWRI 90 Reputation points

2024-04-05T08:16:56.22+00:00

I could get past issue after i added the jars like

pyspark --jars hadoop-azure-3.3.3.jar,azure-storage-7.0.1.jar,jetty-util-ajax-9.4.43.v20210629.jar,jetty-util-9.4.43.v20210629.jar

But not now i am getting the following error

Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for account sparkadlsiae.blob.core.windows.net in the configuration, and its container pyspark is not accessible using anonymous credentials. Please check if the container exists first. If it is not publicly available, you have to provide account credentials. at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:899) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1118)

I want to use SAS to authenticate my connection with storage container

Answer 1

Anand Prakash Yadav 7,855 Microsoft External Staff

Hello SHYAMALA GOWRI,

Thank you for posting your query here!

As per the error message it seems like the Azure Blob Storage is not able to authenticate using the provided Shared Access Signature (SAS) token.

Please make sure that the SAS token you’re using is correct and has not expired.

When using a SAS token, you should set the configuration as follows:

spark.conf.set(
    "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
    "<your-sas-token>"
)

Also, the blob file path should be in the following format:

wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (
    blob_container_name, blob_account_name, blob_relative_path
)

Here are a few posts on similar query that might help: https://stackoverflow.com/questions/72040966/getting-authentication-error-while-accessing-azure-blob-tables-using-pyspark

https://stackoverflow.com/questions/69860096/reading-data-from-blob-without-accountkey-with-pyspark

I hope this helps! Please let me know if the issue persists or if you have any other questions.

SHYAMALA GOWRI 90

@Anand Prakash Yadav SAS token authentication is working with wasbs but with abfss i still face issues

  24/04/22 18:59:08 WARN FileSystem: Failed to initialize fileystem abfss://******@sparkadls2.dfs.core.windows.net/test/people.csv: Unable to load SAS token provider class: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider not foundjava.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider not found
  24/04/22 18:59:08 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: abfss://******@sparkadls2.dfs.core.windows.net/test/people.csv.
  Unable to load SAS token provider class: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider not foundjava.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider not found

I tried with the following code

from pyspark.sql import SparkSession
# Get spark session
spark = SparkSession.builder.appName("Test Spark App").getOrCreate()
# Set config params described above
spark.conf.set("fs.azure.account.auth.type.sparkadls2.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.sparkadls2.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.token.fixed.sparkadls2.dfs.core.windows.net", "si=saspolicy&sv=2022-11-02&sr=c&sig=XXXXXXXXXX")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
# Params
storage_account_name = "sparkadls2"
container_name = "pyspark"
file = "people.csv"
# Read from the adls location
path = "abfss://" + container_name + "@" + storage_account_name + ".dfs.core.windows.net/test/" + file
spark.read.format("csv").load(path).show()

Share via

ADLS blob storage connect from pyspark fails as it couldn't find dependent jar's

0 additional answers

Your answer