How to use HDInsight from Linux
HDinsight is very easy to use from PowerShell, but how would you create and delete a cluster from Linux? How would you submit a job and get the result?
Here is is a simple sample and pointers to further documentation.
1. Create a cluster
You can create a cluster with the Windows Azure Command Line Interface (CLI).
In order to install the CLI, you can go to https://windowsazure.com, downloads. At the bottom of the page, you have two links: one for the CLI itself, the other one is the documentation.
Once you have installed it, you get an azure command line with many options.
The following bash script will create a cluster:
#!/bin/bash
# create an HDInsight cluster
# more information at https://www.windowsazure.com/en-us/documentation/articles/hdinsight-administer-use-command-line/
defaultStorageAccount='monstockageazure'
storageAccount2='wasbshared'
clusterName='monclusterhadoop'
clusterContainerName='monclusterhadoop2'
clusterVersion='2.1'
clusterAdmin='cornac'
clusterConfigFile='./hdinsightCluster.config'
subscription='demos874F33876Y'
clusterPassword='YHqj6sq#ap9'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='
storageAccount2Key='7on846mc+5u9AItkVIEYz1OXwJZ86gN7o7ExURXO3qWJy+jNO56EtfUmRur+/qKkFGc4drA4GvBmhYGiBMlj3g=='
azure account set $subscription
azure hdinsight cluster config create $clusterConfigFile
azure hdinsight cluster config set $clusterConfigFile --clusterName $clusterName --nodes 3 --location "North Europe" --storageAccountName "$defaultStorageAccount.blob.core.windows.net" --storageAccountKey "$defaultStorageAccountKey" --storageContainer "$clusterName" --username "$clusterAdmin" --clusterPassword "$clusterPassword"
azure hdinsight cluster config storage add $clusterConfigFile --storageAccountName "$storageAccount2.blob.core.windows.net" --storageAccountKey "$storageAccount2Key"
azure hdinsight cluster create --config $clusterConfigFile
2. Submit a job
HDInsight exposes an Apache REST API called WebHCat (the former name was Templeton). This allows to submit jobs. It is documented at https://cwiki.apache.org/confluence/display/Hive/WebHCat.
There are tons of ways to call a REST API from Linux. The one I chose for this post is Python. For this sample, you install the “requests” module
pip install requests
then you can run that script (02_submit_hive_job.py):
import requests #https://pypi.python.org/pypi/requests
clusterName='monclusterhadoop'
clusterAdmin='cornac'
clusterPassword='YHqj6sq#ap9'
#get WebHCat status
webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/status'
r = requests.get(webHCatUrl, auth=(clusterAdmin, clusterPassword))
print r.status_code
print r.json()
#submit a hive job:
# SELECT * FROM hivesampletable limit 10
# https://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html
webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/hive'
hive_params={'user.name':clusterAdmin,
'execute':'SELECT * FROM hivesampletable limit 10',
'statusdir': '/wasbwork/hive_from_python'}
r = requests.post(webHCatUrl, auth=(clusterAdmin, clusterPassword), data=hive_params)
print r.status_code
print r.json()
with the following command line:
python 02_submit_hive_job.py
In my case, I got the following result:
benjguin@benjguinu2:~/dev/hdinsight_from_linux$ python 02_submit_hive_job.py
200
{u'status': u'ok', u'version': u'v1'}
200
{u'id': u'job_201402171346_0002'}
You can also get the status of the job, submit pig jobs, submit hive jobs from scripts you uploaded to Windows Azure Storage Blob. Here is a link to the documentation by Hortonworks:
https://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html
and you get a table of contents on the left:
3. Get the result
In the Python script, as we asked the result to be at /wasbwork/hive_from_python, it is stored in the Windows Azure Storage Blob or wasb (in HDInsight, wasb is the default file system over HDFS which is also available at hdfs://namenodehost:9000/(…)). So, once the job is fiinished, and a script can figure it out with this REST API, you get the following files:
So, you can get the result by downloading the result (with azure CLI) and see it with this bash script:
#!/bin/bash
defaultStorageAccount='monstockageazure'
clusterName='monclusterhadoop'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='
export AZURE_STORAGE_ACCOUNT="$defaultStorageAccount"
export AZURE_STORAGE_ACCESS_KEY="$defaultStorageAccountKey"
azure storage blob download $clusterName wasbwork/hive_from_python/stdout
cat wasbwork/hive_from_python/stdout
In my case, this gave the following result:
benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./03_get_result.sh
info: Executing command storage blob download
+ Download blob wasbwork/hive_from_python/stdout in container monclusterhadoop to wasbwork/hive_from_python/stdout
Percentage: 100.0% (809.00B/809.00B) Average Speed: 809.00B/S Elapsed Time: 00:00:00
+ Getting Storage blob information
info: File saved as wasbwork/hive_from_python/stdout
info: storage blob download command OK
8 18:54:20 en-US Android Samsung SCH-i500 California United States 13.9204007 0 0
23 19:19:44 en-US Android HTC Incredible Pennsylvania United States NULL 0 0
23 19:19:46 en-US Android HTC Incredible Pennsylvania United States 1.4757422 0 1
23 19:19:47 en-US Android HTC Incredible Pennsylvania United States 0.245968 0 2
28 01:37:50 en-US Android Motorola Droid X Colorado United States 20.3095339 1 1
28 00:53:31 en-US Android Motorola Droid X Colorado United States 16.2981668 0 0
28 00:53:50 en-US Android Motorola Droid X Colorado United States 1.7715228 0 1
28 16:44:21 en-US Android Motorola Droid X Utah United States 11.6755987 2 1
28 16:43:41 en-US Android Motorola Droid X Utah United States 36.9446892 2 0
28 01:37:19 en-US Android Motorola Droid X Colorado United States 28.9811416 1 0
4. Remove the cluster
In order to remove the cluster, the azure CLI will also help:
#!/bin/bash
clusterName='monclusterhadoop'
azure hdinsight cluster delete $clusterName
this produces the following sample result:
benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./04_removeCluster.sh
info: Executing command hdinsight cluster delete
+ Removing HDInsight Cluster
info: hdinsight cluster delete command OK
benjguin@benjguinu2:~/dev/hdinsight_from_linux$
Conclusion
This post only shows a few simple examples. The goal is to show the principles that can be used. The azure CLI is used to manage the cluster itself, and may also be used to interact with Windows Azure Storage blobs. Submitting jobs can be done with WebHCat REST calls.
Benjamin (@benjguin)