HDFS FileSystem utility not supported for multiple container/storage account

Senapathy, Kumaraswamy 21 Reputation points
2021-02-10T21:28:00.16+00:00

Hi Team,

I am facing a problem on renaming file location(path) from one container to another container using rename function from Hadoop(HDI) Filesystem utility(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html). Get to know that it was accessing only one container at a time which was mentioned in Hadoop Configuration(fs.defaultfs property).
Is there any simple modification to make this happen?

I know that we have Azure-storage-{blob and common) package available for Java and below links shows the direction for it.
https://learn.microsoft.com/en-us/java/api/com.azure.storage.blob.specialized.blobasyncclientbase.begincopy?view=azure-java-stable
https://stackoverflow.com/questions/61117069/how-to-move-azure-blob-from-one-storage-container-to-another-using-java-rest

Before doing the Azure blob copy, would like to know whether can be accomplished by Hadoop configuration change or any other faster and optimized utility is available.

Kindly guide on the above request.

Note:
The hdfs dfs -ls / (brings the default FS storage account) and hdfs dfs -ls abfs://container@storageaccount/ brings the data stored in respective blob storage. But, it was not happening at the FileSystem utility Java utility.

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
201 questions
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,432 Reputation points Microsoft Employee
    2021-02-18T21:18:05.02+00:00

    Hi anonymous user,

    Sorry for the delay in response. Your previous reply was moved to moderation for some reason, I am working with platform team to restore it.

    Please find below insights on your follow-up query.

    1) Why we are hitting wasb?(wasb://) shall I directly hit abfs?(abfs://)?

    • I used wasb just for example, you can use abfs . To further explore about WASB vs ADFS, please refer this doc.

    2) copyUtil using the open(read) and create function.

    i) will this read/write will increase the RU's? compared to azure Blob copy!

    • AFAIK the blob charges are only for the data at rest, quantity and types of operations performed & data redundancy option selected. Since blob copy is beyond my expert area, I would suggest you to please open a new thread in Storage area , where we have our experts who can better assist on this ask.

    ii) The entire part file data will be loaded to the memory and then written to target location. this will badly hit memory.

    • There is limit of buffered data in hadoop and it is 4MB. So it will read 4MB data and then write it to destination blob.

    Azure Blob Copy: https://learn.microsoft.com/en-us/java/api/overview/azure/storage-blob-readme?view=azure-java-stable#copy-a-blob
    Distcp Java Utility:https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/tools/DistCp.html
    Could you suggest which one is best to choose as performance and cost effective.”

    • W.r.t performance I think Distcp Java Utility is better but as Azure Blob copy is beyond my exert area, I would recommend to check with storage experts in Storage area about Azure blob copy.

    If you are interested to explore concrete info about Distcp Java Utility performance or cost, please reach out in Cloudera community.

    Hope this helps.


1 additional answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,432 Reputation points Microsoft Employee
    2021-02-12T19:37:28.23+00:00

    Hi anonymous user,

    Thank you for your patience.

    After having conversation with product team, here are the suggestions.

    Yes, you are correct, FileSystem utility is for single container only. Hence, you will need to write your own code to copy blob from one container/storage to another one. Please check the attached BlbCopyUtil.Java (I renamed it to BlbCopyUtil.txt to be able to upload it here). Below is the code:

    package org.apache.hadoop.io;  
      
    import java.io.IOException;  
    import java.net.URISyntaxException;  
      
    import org.apache.hadoop.conf.Configuration;  
    import org.apache.hadoop.fs.FSDataInputStream;  
    import org.apache.hadoop.fs.FSDataOutputStream;  
    import org.apache.hadoop.fs.FileSystem;  
    import org.apache.hadoop.fs.Path;  
      
    public class BlobCopyUtil {  
        
      public static void main(String[] args)  
          throws IllegalArgumentException, IOException, URISyntaxException {  
        copyUtil(  
            new Path(  
                "wasb://test@hadoopazurejihdistorage.blob.core.windows.net/file1"),  
            new Path(  
                "wasb://test@kranthihdistorage.blob.core.windows.net/file5"));  
      }  
      
      public static void copyUtil(Path src, Path dest)  
          throws IOException, URISyntaxException {  
        Configuration conf = new Configuration();  
        FileSystem storage1Fs = src.getFileSystem(conf);  
        FileSystem storage2Fs = dest.getFileSystem(conf);  
        FSDataInputStream in = storage1Fs  
            .open(Path.getPathWithoutSchemeAndAuthority(src));  
        FSDataOutputStream out = storage2Fs  
            .create(Path.getPathWithoutSchemeAndAuthority(dest));  
        IOUtils.copyBytes(in, out, 512 * 1024 * 1024);  
      }  
    }  
    

    Same you can do using shell command :

    hdfs dfs -cp wasb://container1@storage1.blob.core.windows.net/file1 wasb://container2@storage2.blob.core.windows.net/file2  
    

    If you want to copy files in bulk, please use distcp tool.

    hadoop distcp wasb://container1@storage1.blob.core.windows.net/srcDir wasb://container2@storage2.blob.core.windows.net/destDir  
    

    Hope this helps

    ----------

    Thank you
    Please do consider to click on "Accept Answer" and "Upvote" on the post that helps you, as it can be beneficial to other community members.

    67541-blobcopyutil.txt