what is the best way to copy data from my hadoop on prem cluster to the azure hdinsight cluster?

Richmond Yu 1 Reputation point
2022-05-16T20:57:25.917+00:00

hi experts,
what is the best way to copy data from my hadoop on prem cluster to the azure hdinsight cluster?
So we recently deployed a new hdinsight cluster and now I would like to copy some data from my onprem cluster to hdinsight.

Thanks,

Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
200 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 81,151 Reputation points Microsoft Employee
    2022-05-17T05:26:10.363+00:00

    Hello @Richmond Yu ,

    Thanks for the question and using MS Q&A platform.

    Azure HDInsight follows a strong separation of compute & storage - as such the recommendation is to store your data either in Azure Storage blobs and Azure Data Lake Store, or a combination of the two. Both provide an HDFS compatible file system that persists data even if the cluster is deleted.

    Note: Given that you should not use the local virtual machine storage for your data, where should you store the data managed by your HDInsight cluster?

    Although an on-premises installation of Hadoop uses the HDFS for storage on the cluster, in Azure you should use external storage services, and not the local disk storage provided by the virtual machines in the cluster.

    The benefit of this approach is:

    • The data is persistent, even after you delete your HDInsight cluster. This means it will also be available without any data transfer effort should you deploy a new cluster to perform additional processing.
    • The costs for storing your data are predominalty driven by the volume of data stored and tranferred, which can be signficantly less than the costs for running a cluster.
    • The data is available for multiple clusters to act upon.

    Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.5 or later, you can select either Azure Storage or Azure Data Lake Store as the default file system.

    In addition to this default file system, you can add additional Azure Storage Accounts or Data Lake Store instances during the cluster creation process or after a cluster has been created.

    202654-image.png

    To copy data from on-premise cluster to Azure Storage account you can refer to the below articles:

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators