ADLS blob storage connect from pyspark fails as it couldn't find dependent jar's

SHYAMALA GOWRI 90 Reputation points
2024-04-04T16:16:29.7833333+00:00

I am trying to execute this code in pyspark with below commands

pyspark --jars hadoop-azure-3.2.1.jar,azure-storage-8.6.4.jar, jetty-util-ajax-12.0.7.jar, jetty-util-12.0.7.jar (my spark version is 3.5.1)

and it fails with the following, need ur advice in organising the classpaths here

  File "/usr/local/Cellar/apache-spark/3.5.1/libexec/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/log/Log
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createPermissionJsonSerializer(AzureNativeFileSystemStore.java:406)
	at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.<clinit>(AzureNativeFileSystemStore.java:310)
	at org.apache.hadoop.fs.azure.NativeAzureFileSystem.createDefaultStore(NativeAzureFileSystem.java:1404)
	at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1331)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)

my pyspark code is

from pyspark.sql import SparkSession
#Get spark session
spark = SparkSession.builder.appName("Test Spark App").getOrCreate()
#Set config params described above
spark.conf.set("fs.azure.sas.sparkadlsiae.blob.core.windows.net", "sv=2022-11-02&ss=bfqt&srt=o&sp=rwdlacupyx&se=2024-04-04T01:46:06Z&st=2024-04-03T17:46:06Z&spr=https&sig=abcd%3D")
#params
storage_account_name= "sparkadlsiae"
container_name = "pyspark"
file = "people.csv"
#Read from the adls location
path = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+file
spark.read.format("csv").load(path).show()
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
{count} votes

Accepted answer
  1. Anand Prakash Yadav 7,855 Reputation points Microsoft External Staff
    2024-04-08T11:20:24.4433333+00:00

    Hello SHYAMALA GOWRI,

    Thank you for posting your query here!

    As per the error message it seems like the Azure Blob Storage is not able to authenticate using the provided Shared Access Signature (SAS) token.

    Please make sure that the SAS token you’re using is correct and has not expired.

    When using a SAS token, you should set the configuration as follows:

    spark.conf.set(
        "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
        "<your-sas-token>"
    )
    

    Also, the blob file path should be in the following format:

    wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (
        blob_container_name, blob_account_name, blob_relative_path
    )
    

    Here are a few posts on similar query that might help: https://stackoverflow.com/questions/72040966/getting-authentication-error-while-accessing-azure-blob-tables-using-pyspark

    https://stackoverflow.com/questions/69860096/reading-data-from-blob-without-accountkey-with-pyspark

    I hope this helps! Please let me know if the issue persists or if you have any other questions.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.