Files not getting saved in Azure blob using Spark in HDInsights cluster

Saif Ahmad 21 Reputation points
2022-06-14T11:44:29.873+00:00

We've setup HDInsights cluster on Azure with Blob as the storage for Hadoop. We tried uploading files to the Hadoop using hadoop CLI and the files were getting uploaded to the Azure Blob.

Command used to upload:

Hadoop fs -put somefile /testlocation  

However when we tried using Spark to write files to the Hadoop, it was not getting uploaded to Azure Blob storage but to the disk of the VMs at the directory specified in the hdfs-site.xml for the datanode.
Code used:

df1mparquet = spark.read.parquet("hdfs://hostname:8020/dataSet/parquet/")  
df1mparquet .write.parquet("hdfs://hostname:8020/dataSet/newlocation/")  

Strange behavior:
When we run:

hadoop fs -ls / => It lists the files from Azure Blob storage    
hadoop fs -ls hdfs://hostname:8020/ => It lists the files from local storage  

is this an expected behavior?

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,425 questions
Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
198 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 76,746 Reputation points Microsoft Employee
    2022-06-15T05:38:24.537+00:00

    Hello @Saif Ahmad ,

    Thanks for the question and using MS Q&A platform.

    This is an excepted behaviour in Azure HDInsight.

    As a best practice, you should not use the local storage on the disk of the VMs. Recommended to use the Azure Storage account to save the files.

    211535-image.png

    There are several ways you can access the files in Data Lake Storage from an HDInsight cluster. The URI scheme provides unencrypted access (with the wasb: prefix) and TLS encrypted access (with wasbs). We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

    • Using the fully qualified name. With this approach, you provide the full path to the file that you want to access. wasb://<containername>@<accountname>.blob.core.windows.net/<file.path>/
      wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/
    • Using the shortened path format. With this approach, you replace the path up to the cluster root with: wasb:///<file.path>/
      wasbs:///<file.path>/
    • Using the relative path. With this approach, you only provide the relative path to the file that you want to access. /<file.path>/

    For more details, refer to Azure HDInsight - Access files from within cluster and Run Apache Spark from the Spark Shell.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful