Access Azure Blob Stores from HDInsight
Small Bites of Big Data
Edit Mar 6, 2014: This is no longer necessary for HDInsight - you specify the storage accounts when you create the cluster and the rest happens auto-magically. See https://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or https://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.
One of the great enhancements in Microsoft's HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.
Azure Blob Storage
Let's start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: https://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.
You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.
HDInsight Service Preview
When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn't allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/.
You can then add additional storage accounts to the cluster by updating C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the "Edit container metadata" dialog on the Azure portal.
The key storage properties in the default core-site.xml on HDInsight Service Preview are:
<property>
<name>fs.default.name</name>
<!-- cluster variant -->
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property><property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property><property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>
To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the "MANAGE KEYS" button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we'll discuss below.
Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the "Local Resources" tab under the "More" button in the "Local devices and resources" section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).
On the head node make a copy of C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.
Add your 2nd Azure storage account by adding another property.
<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>
Save core-site.xml.
Repeat for each storage account you need to access from this cluster.
HDInsight Server Preview
If you have downloaded the on-premises HDInsight Server preview from https://microsoft.com/bigdata that gives you a single node "OneBox" install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.
Make a backup copy of C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml from your local installation (local could be on a VM).
Edit core-site.xml:
1) Change the default file system from local HDFS to remote ASV
<property>
<name>fs.default.name</name>
<!-- cluster variant -->
<value>hdfs://localhost:8020</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>
to:
<property>
<name>fs.default.name</name>
<!-- cluster variant -->
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>
2) Add the namenode property (do not change any values)
<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>
3) Add the information that associates the key value with your default storage account
<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>
4) Add any additional storage accounts you plan to access
<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>
Save core-site.xml.
Files
Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:
hadoop fs -put c:\data\ asv://data@sqlcatwomanblog.blob.core.windows.net/
Or upload a single file:
hadoop fs -put c:\data\bacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt
To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):
hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/
Found 1 items
-rwxrwxrwx 1 124 2013-04-24 20:12 /bacon.txt
In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.
For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.
hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt
More Information
- HDInsight Documentation Portal https://www.windowsazure.com/en-us/manage/services/hdinsight/
- Using Windows Azure Blob Storage with HDInsight https://www.windowsazure.com/en-us/manage/services/hdinsight/howto-blob-store/
- Hadoop Shell Commands https://hadoop.apache.org/docs/r0.18.1/hdfs_shell.html
- Updated HDInsight on Azure ASV paths for multiple storage accounts https://dennyglee.com/2013/03/25/updated-hdinsight-on-azure-asv-paths-for-multiple-storage-accounts/
I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.
Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.
Comments
- Anonymous
October 08, 2013
Note: ASV is being replaced with WASB. For HDInsight 2.0 both work in the syntax, but start replacing ASV with WASB in your settings and code. - Anonymous
December 11, 2013
Dear Cindy,Thanks for the post.It was really usefulBut using hadoop fs -lsr wasb:// or hadoop fs -lsr wasb:// lists all the contents of the container.Instead of it would be nice to have step by step display of contents/user/Folder 1/SubFolder 1/file1.jpg/user/Folder 1/SubFolder 1/file2.jpg/user/Folder 1/SubFolder 1/file3.jpg/user/Folder 2/SubFolder 1/file1.jpg/user/Folder 2/SubFolder 1/file2.jpg/user/Folder 2/SubFolder 1/file3.jpghadoop fs -lsr wasb:///user/------------------------>Gives all the above files.It would be nice if we get Folder 1 and Folder 2 rather than all.Is this option possible currently? - Anonymous
January 14, 2014
To see how to enable Azure Storage (WASB, previously ASV) for Hortonworks HDP on Azure VMs (IaaS) see my colleague Alexei's blog: alexeikh.wordpress.com/.../expanding-hdp-hadoop-file-system-to-azure-blob-storage - Anonymous
April 15, 2014
Hi,How do you connect to a HDInsight cluster by using web and not Remote, the "Manage" option? The "Manage" button has been removed or is invisible from a cluster page. The only option I see is "Delete", "Connect"(after enabling remote), and "Disable Remote"; no "Manage".Any idea?Many thanks in advance.Oscar - Anonymous
April 15, 2014
The comment has been removed - Anonymous
November 05, 2014
Hi,How can we access the HDInsight data from our on-premise applications, and retrieve files like images in the application?RegardsNirav - Anonymous
November 06, 2014
By default, HDInsight clusters simply point to data sitting on an Azure storage account. As long as it's not in some proprietary format like ORCFile or HBase the data is accessible just like any other files on the storage accounts. Look up WASB to learn more on the concept of how Microsoft and Hortonworks built the WASB extension on top of HDFS to allow the separation of storage and compute on Azure.