Error when running notebook code with pyspark in a virtual machine

Question

Error when running notebook code with pyspark in a virtual machine

Michael C 0

I am running code that uses pyspark to access blob storage files. The code works in Azure Notebook with a serverless spark instance. When I create a virtual machine and run the code in a python script, the code does not work. One error I receive is:

24/01/24 22:57:54 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: wasbs://*****@tdp.blob.core.windows.net/41/4827/.java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found

How can I run a spark session that accesses blob storage in an Azure virtual machine?

Silvia Wibowo 6,046 Reputation points Microsoft Employee Volunteer Moderator

2024-01-25T05:15:53.92+00:00

Hi @Michael C , have you installed hadoop-azure?

Michael C 0

I ran the first command in your link in my VM shell: export HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake" Since I am using spark, unless I am missing something, the rest the process in your link is handled by the commands in python to create the spark session:

spark = SparkSession.builder \
    .appName("Read from Azure Blob Storage") \
    .getOrCreate()

# Set Spark configurations to read from Azure Blob Storage
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net", storage_account_access_key)

# Read data from Azure Blob Storage into a DataFrame
data_path =  'wasbs://******@tdp.blob.core.windows.net/41/1011/*'

df = spark.read.parquet(data_path)

When I run the above script, I received the following error below. The script runs without a problem in notebook. Any idea what is wrong? Is there more I have to do with Hadoop in a VM?

24/01/26 18:49:23 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: wasbs://******@tdp.blob.core.windows.net/41/1011/*.
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
	... 26 more

2 answers

Your answer

Silvia Wibowo 6,046 Reputation points Microsoft Employee Volunteer Moderator

2024-01-25T05:15:53.92+00:00

Hi @Michael C , have you installed hadoop-azure?

Answer 1

Silvia Wibowo 6,046 Microsoft Employee Volunteer Moderator

Hi @Michael C, I think you need to copy azure-storage and hadoop-azure jars locally, like described in the answer of this question: https://stackoverflow.com/questions/38254771/spark-shell-error-no-filesystem-for-scheme-wasb

Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

Michael C 0 Reputation points

2024-01-30T04:02:51.83+00:00
I tried all of that. The error that now occurs is: I type, replacing 'my_container_name' and 'my_blob_account_name',

/usr/local/hadoop/bin/hadoop fs -ls wasb:///

and get the error: WARN fs.FileSystem: Failed to initialize filesystem wasb:///: java.lang.IllegalArgumentException: Cannot initialize WASB file system, URI authority not recognized. -ls: Cannot initialize WASB file system, URI authority not recognized.
Silvia Wibowo 6,046 Reputation points Microsoft Employee Volunteer Moderator

2024-02-01T00:42:04.58+00:00

Hi @Michael C, after installing jars, can you run your python script successfully?
MichaelCampbell-7305 0 Reputation points

2024-02-08T18:02:24.1466667+00:00

No. It does not run.

Answer 2

Hi MichaelCampbell-7305

Error: WARN fs.FileSystem: Failed to initialize filesystem wasb:///: java.lang.IllegalArgumentException: Cannot initialize WASB file system, URI authority not recognized. -ls: Cannot initialize WASB file system, URI authority not recognized.

This error occurs because the hadoop fs command is not able to recognize the wasb protocol. To fix this, you need to add the hadoop-azure JAR file to the classpath of your Hadoop installation. Adding high level steps:

Download the hadoop-azure JAR file from the Apache Hadoop website. Make sure to download the version that matches the Hadoop version you are using.
Copy the JAR file to the lib directory of your Hadoop installation. For example, if your Hadoop installation is located at /usr/local/hadoop, you can copy the JAR file to /usr/local/hadoop/lib.
Set the HADOOP_CLASSPATH environment variable to include the path to the hadoop-azure JAR file.
Reload your .bashrc file by running the following command. This will make the HADOOP_CLASSPATH environment variable available in your current terminal session.
Try running the hadoop fs command again.

If you still get an error, I recommend checking with support.

Share via

Error when running notebook code with pyspark in a virtual machine

2 answers

Your answer