Hello @Richmond Yu ,
Thanks for the question and using MS Q&A platform.
Azure HDInsight follows a strong separation of compute & storage - as such the recommendation is to store your data either in Azure Storage blobs and Azure Data Lake Store, or a combination of the two. Both provide an HDFS compatible file system that persists data even if the cluster is deleted.
Note: Given that you should not use the local virtual machine storage for your data, where should you store the data managed by your HDInsight cluster?
Although an on-premises installation of Hadoop uses the HDFS for storage on the cluster, in Azure you should use external storage services, and not the local disk storage provided by the virtual machines in the cluster.
The benefit of this approach is:
- The data is persistent, even after you delete your HDInsight cluster. This means it will also be available without any data transfer effort should you deploy a new cluster to perform additional processing.
- The costs for storing your data are predominalty driven by the volume of data stored and tranferred, which can be signficantly less than the costs for running a cluster.
- The data is available for multiple clusters to act upon.
Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.5 or later, you can select either Azure Storage or Azure Data Lake Store as the default file system.
In addition to this default file system, you can add additional Azure Storage Accounts or Data Lake Store instances during the cluster creation process or after a cluster has been created.
To copy data from on-premise cluster to Azure Storage account you can refer to the below articles:
- Copy data from the HDFS server using Azure Data Factory or Synapse Analytics
- Use Azure Data Factory to migrate data from an on-premises Hadoop cluster to Azure Storage
- Migrate on-premises Hadoop data to Azure Data Lake Storage Gen2 with WANdisco LiveData Platform for Azure
Hope this will help. Please let us know if any further queries.
------------------------------
- Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
- Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators