How to install Hadoop on Windows Azure Linux virtual machines
Windows Azure HDInsight is the simplest option to get a Hadoop cluster up and running very quickly in a Windows Azure environment. Among other numerous advantages, this service allows to use Windows Azure blob storage (ASV or Azure Storage Vault for short) exactly as HDFS (Hadoop distributed file system). In its public beta version, Windows Azure HDInsight uses ASV as its default file system. I put some screen shots of what you can easily do (in a matter of minutes) with such a cluster in this post: TUTO - Hadoop arrive dans le portail Windows Azure. Lançons des jobs Java, PIG et HIVE pour voir !.
Still, you may want to install a custom distribution, run custom components which are not available in the HDInsight distribution, or have Hadoop running on Linux instead of Windows. This post shows how to install a Linux distribution on Windows Azure virtual machines.
While it is possible to install several distributions like MapR, CDH (Cloudera) or HortonWorks (HDP) on different Linux OSs like CentOS, SUSE, or Ubuntu, this post takes HDP 1.2 on CentOS as an example. The documentation I want to be able to follow here in order to install the cluster is the Hortonworks’ HDP 1.2.2 installation with Ambari on CentOS.
In this post, I suppose you already have a Windows Azure account. The windowsazure.com web site can provide you with the information you need to get one. I also suppose readers are quite advanced users so I don’t give all the details when documentation also exists elsewhere.
Scope
This blog post shows one way to install a Linux Hadoop cluster. It may not respect all the best practices, particularly in terms of security. The goal of the post is to show how this kind of environment can be hosted in Windows Azure. In this post, I chose to use Windows (DNS Server, Web Browser, scripting environment, …) because this is what is simple for me, but this is possible to install the cluster without using Windows at all. I give some hints on how to do that. Also, I use the portal and PowerShell (scripting) because I find it more easy to understand that way, but I’m pretty confident everything could be done with scripting.
Choosing the Windows Azure environment for the Linux cluster
There are several ways to have a local network in Windows Azure. One of them is to create a virtual network.
I also want to be able to use a browser against the cluster and have a DNS I can easily setup and manage. In this scenario, I install a Windows Server which plays those roles. This is because I’m more a Windows guy ! Note that you may prefer installing a Linux based DNS server in the virtual Network and browse the cluster thru an SSH tunnel.
The following table contains the machines I want to instanciate and their roles:
Server Name | Server role | subnet |
n124dns | DNS Browser (Windows Server) | subnet1 |
n124m | master node | subnet2 |
n124w1 to n124w3 | worker nodes | subnet3 |
In this sample, all machines are in the following DNS domain name : n124.benjguin.com
At the end, I’ll have those machines:
n124dns will be a Windows Server machine, and it could be a Linux machine.
All other will be Linux machines.
Configure the Virtual Network
You can find details on Windows Azure virtual networks here: https://www.windowsazure.com/en-us/manage/services/networking/.
I create a virtual network like explained at https://www.windowsazure.com/en-us/manage/services/networking/create-a-virtual-network/.
The parameters I enter are detailed in the XML file below (this file can be obtained by exporting the virtual network configuration from the Windows Azure Management portal).
<NetworkConfiguration xmlns:xsd="https://www.w3.org/2001/XMLSchema" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://schemas.microsoft.com/ServiceHosting/2011/07/NetworkConfiguration">
<VirtualNetworkConfiguration>
<Dns>
<DnsServers>
<DnsServer name="n124dns" IPAddress="10.124.1.4" />
</DnsServers>
</Dns>
<VirtualNetworkSites>
<VirtualNetworkSite name="n124" AffinityGroup="NorthEurope2Aff">
<AddressSpace>
<AddressPrefix>10.0.0.0/8</AddressPrefix>
</AddressSpace>
<Subnets>
<Subnet name="Subnet-1">
<AddressPrefix>10.124.1.0/24</AddressPrefix>
</Subnet>
<Subnet name="Subnet-2">
<AddressPrefix>10.124.2.0/24</AddressPrefix>
</Subnet>
<Subnet name="Subnet-3">
<AddressPrefix>10.124.3.0/24</AddressPrefix>
</Subnet>
</Subnets>
<DnsServersRef>
<DnsServerRef name="n124dns" />
</DnsServersRef>
</VirtualNetworkSite>
</VirtualNetworkSites>
</VirtualNetworkConfiguration>
</NetworkConfiguration>
from the portal, the virtual network looks like this:
Install the DNS Server
I install the DNS server as a Windows Server 2012 VM. It could also be a Linux machine, but I find it simpler to use a Windows Server which will also serve as a desktop environment for administrative tasks like browsing services like Ambari Server or Hadoop dashboards from a machine that has local access to the whole cluster.
I create a new Windows Server 2012 VM. For that, I go to the Windows Azure Management portal, and select NEW at the bottom left of the web page. Then,
Then I can connect to this machine from the portal:
from the Server Manager which can be started with this icon at the bootom left of the Desktop, I do the following:
I ignore the following warning:
After the installation, I can use the DNS configuration console that can be started that way:
I create a zone called n124.benjguin.com (benjguin.com is a domain I own) and its associated reverse lookup zone. Here is how it looks:
NB: you may also want to turn off the IE Enhanced Security Configuration here:
Create a Linux image
HDP documentation explains how to install on RPM based Linux OSs. CentOS is one of them. So let’s install CentOS virtual machines.
For that, I go to the Windows Azure Management portal, and select NEW at the bootom left of the web page. Then,
Yes
I’ll follow the Hortonworks’ documentation for a CentOS 6 OS. In particular, chapter 1.5 explains how to prepare the environment. I’ll give URLs in the documentation so that you can have the context, as well as the main steps I follow in my environment.
ssh-keygen
(keep defaults)
sudo su
cd
mkdir .ssh
cat /home/benjguin/.ssh/id_rsa.pub >> .ssh/authorized_keys
chmod 700 .ssh
chmod 640 .ssh/authorized_keys
# cf https://stackoverflow.com/questions/7268788/centos-6-sshd-server-refused-our-key
restorecon -R -v /root/.ssh
CTRL-D
ssh root@localhost should work without asking for a password
sudo –s
setenforce 0
vi /etc/selinux/config
sudo –s
chkconfig iptables off
/etc/init.d/iptables stop
sudo –s
vi /etc/yum/pluginconf.d/refresh-packagekit.conf
sudo shutdown -h now
(…)
Now, I have my own Linux image. I’ll use it to create the different Linux VMs I need in my cluster.
Instanciate Linux virtual machines
sudo –s
vi /etc/hosts
sudo –s
fdisk –l
grep SCSI /var/log/messages
fdisk /dev/sdc
mkfs.ext3 /dev/sdc1
mkdir /mnt/datadrive
mount /dev/sdc1 /mnt/datadrive
vi /etc/fstab
add the following line at the end of the file:
/dev/sdc1 /mnt/datadrive ext3 defaults 1 2
The other machines in the cluster can be created thru the portal too. But that can also be done thru a script. There are two main flavors of automation scripts in Windows Azure: the Windows Azure PowerShell module can be used from Windows machines, the Command Line Interface (CLI for short, more information) which is based on Node.js can be used from Windows, Mac and Linux. They can be downloaded from https://www.windowsazure.com/en-us/downloads/.
As I’m a Windows guy, I will use PowerShell here. The details on how to start with the Windows Azure PowerShell cmdlets is available here.
Here is the script I use to create n124w1, n124w2 and n124w3:
Import-Module azure
#select default subscription and storage account
$subscription = 'Azure bengui'
Set-AzureSubscription -SubscriptionName $subscription -CurrentStorageAccount 'northeurope2affstorage'
Set-AzureSubscription -DefaultSubscription $subscription
#$adminPassword="******obfuscated*****"
+#region secret
#endregion
#create an empty collection of VMConfigs
$vms = @()
#loop on the three VM we want to create
for($i=1; $i -le 3; $i++)
{
Write-Host "creating n124w${i}OS"
$sshPort = 52200 + $i
#create a new VM Config
$newVM = `
New-AzureVMConfig -ImageName myCentOSImage -InstanceSize Medium -Name "n124w$i" `
-AvailabilitySetName "n124wAvailabilitySet" -DiskLabel "n124w${i}os" `
-HostCaching ReadWrite -Label "n124w$i" |
Add-AzureProvisioningConfig -Linux -LinuxUser "benjguin" -Password $adminPassword |
Add-AzureDataDisk -CreateNew -DiskLabel n124w${i}data1 -DiskSizeInGB 100 -LUN 0 |
Add-AzureEndpoint -LocalPort 22 -Name "SSH$i" -Protocol tcp -PublicPort $sshPort |
Set-AzureSubnet 'Subnet-3'
#add the VM config to the collection
$vms += ,$newVM
}
#show the collection
$vms | format-table
#create the VM
New-AzureVM -ServiceName 'n124' -VMs $vms -VNetName 'n124'
This generates the following output:
creating n124w1OS
creating n124w2OS
creating n124w3OS
AvailabilitySetName ConfigurationSets DataVirtualHardDisks Label OSVirtualHardDisk RoleName RoleSize RoleType
------------------- ----------------- -------------------- ----- ----------------- -------- -------- --------
n124wAvailabilitySet {n124w1, Microsoft... {} bjEyNHcx Microsoft.Samples.... n124w1 Medium PersistentVMRole
n124wAvailabilitySet {n124w2, Microsoft... {} bjEyNHcy Microsoft.Samples.... n124w2 Medium PersistentVMRole
n124wAvailabilitySet {n124w3, Microsoft... {} bjEyNHcz Microsoft.Samples.... n124w3 Medium PersistentVMRole
WARNING: VNetName, DnsSettings, DeploymentLabel or DeploymentName Name can only be specified on new deployments.
OperationDescription OperationId OperationStatus
-------------------- ----------- ---------------
New-AzureVM - Create VM n124w1 652a8988-df51-4777-94ae-bc7764296b0b Succeeded
New-AzureVM - Create VM n124w2 5b165d43-dcc7-4f5d-a1fc-ae77c6fa62dd Succeeded
New-AzureVM - Create VM n124w3 70750433-9f19-493c-9fdd-a8aad69fa193 Succeeded
and that can be seen in the Windows Azure management portal:
The DNS server can be manually updated in order to obtain the following configuration:
Then, from n124m, I can connect to other machines like this:
From there, I can update /etc/hosts in order to have hostname and hostname –f having the right names.
I can also format the data disks.
These are the same steps as before for n124w*. To summarize, here a few screen shots for n124w2:
mkfs.ext3 /dev/sdc1
mkdir /mnt/datadrive
mount /dev/sdc1 /mnt/datadrive
vi /etc/fstab
add the following line at the end of the file:
/dev/sdc1 /mnt/datadrive ext3 defaults 1 2
(same for other n124w* machines).
Now I have a whole set of machines in the same network, and they are ready to have Hadoop installed on them.
Install Hadoop distribution
I can now start at step 2 of the installation in Hortonworks’ documentation.
connect to n124m
sudo su
rpm -Uvh https://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari-1.x-1.el6.noarch.rpm
yum install epel-release
y
yum install ambari-server
y
(…)
ambari-server setup
(I use default answers during the setup)
ambari-server start
I am now at step 3 of the Hortonworks’ documentation
https://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.2/bk_using_Ambari_book/content/ambari-chap3-1.html states to connect to https://{main.install.hostname}:8080. I do that from the Windows Server machine which is in the local network where the Linux machines are.
NB: in order to connect to n124dns, one simple way is to select this VM in the Windows Azure management portal and click Connect:
In a remote desktop on the n124dns machine, I do the following:
I connect with admin/admin
Next screen asks for the .ssh/id_rsa key. The simplest way is to copy it from an ssh session (<benjguin@n124.cloudapp.net:22> in my case) and paste it in the browser.
in the customize services web page, I have
I choose to remove /mnt/resource folders because it corresponds to a disk that lives with the VM and which is not persisted in the Windows Azure storage folder. This disk is destroyed when one destroys the VM. HDFS would support it, but I want to be able to stop my whole cluster without loosing HDFS data. So I change to the following:
I do the same in tab MapReduce
I also enter the required passwords and click Next
the detail is the following
Admin Name : admin
Cluster Name : n124hdp
Total Hosts : 4 (4 new)
Local Repository : No
Services
HDFS
-
- NameNode : n124m.n124.benjguin.com
-
- SecondaryNameNode : n124w1.n124.benjguin.com
-
- DataNodes : 3 hosts
MapReduce
-
- JobTracker : n124m.n124.benjguin.com
-
- TaskTrackers : 3 hosts
Nagios
-
- Server : n124m.n124.benjguin.com
-
- Administrator : nagiosadmin / (web@benjguin.com)
Ganglia
-
- Server : n124m.n124.benjguin.com
Hive + HCatalog
-
- Hive Metastore : n124m.n124.benjguin.com
-
- Database : MySQL (New Database)
HBase
-
- Master : n124m.n124.benjguin.com
-
- Region Servers : 3 hosts
Oozie
-
- Server : n124m.n124.benjguin.com
ZooKeeper
-
- Servers : 3 hosts
…
Run
Let’s now test HDFS, PIG and HIVE by ourselves in this cluster.
I open a new SSH connection to the master node (n124m, available at n124.cloudapp.net:22)
let’s copy /etc/paswd file to /hwork/in HDFS folder
[benjguin@n124m ~]$ cp test.pig /tmp
[benjguin@n124m ~]$ cd /tmp
[benjguin@n124m tmp]$ cat test.pig
A = load '/hwork/in' using PigStorage(':');
B = foreach A generate \$0 as id;
store B into '/hwork/out1';
[benjguin@n124m tmp]$ sudo -u hdfs pig test.pig
2013-04-05 10:54:08,586 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.1.23 (rexported) compiled Mar 28 2013, 12:20:36
2013-04-05 10:54:08,587 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1365159248577.log
2013-04-05 10:54:09,141 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://n124m.n124.benjguin.com:8020
2013-04-05 10:54:09,318 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: n124m.n124.benjguin.com:50300
2013-04-05 10:54:10,104 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2013-04-05 10:54:10,294 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-04-05 10:54:10,323 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2013-04-05 10:54:10,323 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2013-04-05 10:54:10,431 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-04-05 10:54:10,469 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-04-05 10:54:10,471 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5927476079760658662.jar
2013-04-05 10:54:14,037 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5927476079760658662.jar created
2013-04-05 10:54:14,061 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2013-04-05 10:54:14,101 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-04-05 10:54:14,603 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-04-05 10:54:15,036 [Thread-7] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-04-05 10:54:15,036 [Thread-7] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-04-05 10:54:15,052 [Thread-7] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-04-05 10:54:15,053 [Thread-7] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-04-05 10:54:15,053 [Thread-7] INFO org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-04-05 10:54:15,056 [Thread-7] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-04-05 10:54:16,311 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201304050850_0005
2013-04-05 10:54:16,311 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://n124m.n124.benjguin.com:50030/jobdetails.jsp?jobid=job_201304050850_0005
2013-04-05 10:54:26,873 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-04-05 10:54:30,936 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-04-05 10:54:30,938 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.2.23 0.10.1.23 hdfs 2013-04-05 10:54:10 2013-04-05 10:54:30 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201304050850_0005 1 0 3 3 3 0 0 0 A,B MAP_ONLY /hwork/out1,
Input(s):
Successfully read 45 records (2588 bytes) from: "/hwork/in"
Output(s):
Successfully stored 45 records (304 bytes) in: "/hwork/out1"
Counters:
Total records written : 45
Total bytes written : 304
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201304050850_0005
2013-04-05 10:54:30,954 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
[benjguin@n124m tmp]$
so PIG works!
let’s now do the same thing with HIVE:
that works too!
Automate the cluster [de]provisioning
Once I have the VMs setup, I want to be able to stop paying for compute by shutting down and removing the VMs (while still keeping the virtyual hard disks VHD).
Il also want to be able to restart the whole cluster quite quickly.
As seen before in this post, it is possible to automate by scripting. Here are the scripts I use to stop and start this cluster.
NB: I took the OS disk names and data disk names from the Windows Azure portal
To shut up the cluster, I use
#region init
Import-Module 'c:\Program Files (x86)\Microsoft SDKs\Windows Azure\PowerShell\Azure\Azure.psd1'
$subscription = 'Azure bengui'
Set-AzureSubscription -SubscriptionName $subscription -CurrentStorageAccount 'northeurope2affstorage'
Set-AzureSubscription -DefaultSubscription $subscription
$cloudSvcName = 'n124'
#endregion
#region shutdown and delete
echo 'will shut down and remove the following'
#$vms = Get-AzureVM -ServiceName $cloudSvcName | where { !($_.name -eq 'n124dns') }
$vms = Get-AzureVM -ServiceName $cloudSvcName
$vms | select name
$vms | Stop-AzureVM
$vms | Remove-AzureVM
#endregion
To startup the cluster, I use:
Import-Module azure
$subscription = 'Azure bengui'
Set-AzureSubscription -SubscriptionName $subscription -CurrentStorageAccount 'northeurope2affstorage'
Set-AzureSubscription -DefaultSubscription $subscription
$cloudSvcName = 'n124'
$vNetName = 'n124'
$vms = @()
$vmName='n124dns'
$n124dns = New-AzureVMConfig -DiskName 'n124dns-n124dns-0-20130117132807' -InstanceSize Small -Name $vmName -Label $vmName |
Set-AzureSubnet 'Subnet-1' |
Add-AzureEndpoint -LocalPort 3389 -Name 'RDP' -Protocol tcp -PublicPort 3389
$vms += ,$n124dns
$vmName='n124m'
$vm1 = New-AzureVMConfig -DiskName 'n124-n124m-2013-04-03' -InstanceSize Medium -Name $vmName -Label $vmName |
Add-AzureDataDisk -DiskName 'n124-n124m-0-201304031626300873' -Import -LUN 0 |
Add-AzureEndpoint -LocalPort 22 -Name 'SSH' -Protocol tcp -PublicPort 22 |
Set-AzureSubnet 'Subnet-2'
$vms += ,$vm1
$osDiskNames = @('n124-n124w1-0-201304041529430938', `
'n124-n124w2-0-201304041530370691', `
'n124-n124w3-0-201304041532060509')
$dataDiskNames = @('n124-n124w1-0-201304041529470297', `
'n124-n124w2-0-201304041530410363', `
'n124-n124w3-0-201304041532090540')
#loop on the three VM we want to create
for($i=1; $i -le 3; $i++)
{
Write-Host "creating n124w${i}OS"
$sshPort = 52200 + $i
$j = $i - 1
#create a new VM Config
$newVM = `
New-AzureVMConfig -DiskName $osDiskNames[$j] -InstanceSize Medium -Name "n124w$i" `
-AvailabilitySetName "n124wAvailabilitySet" -Label "n124w$i" |
Add-AzureDataDisk -DiskName $dataDiskNames[$j] -Import -LUN 0 |
Add-AzureEndpoint -LocalPort 22 -Name "SSH$i" -Protocol tcp -PublicPort $sshPort |
Set-AzureSubnet 'Subnet-3'
#add the VM config to the collection
$vms += ,$newVM
}
#show the collection
$vms | format-table
#create the VM
New-AzureVM -ServiceName 'n124' -VMs $vms -VNetName 'n124'
So I now have a Linux cluster with Hadoop installed and I can start, and stop it (without loosing HDFS data) with scripts.
Benjamin
Comments
- Anonymous
April 08, 2013
Excellent post Ben! - Anonymous
March 04, 2014
Please also see Niall's Blog post: Allocating Static IP Addresses to your VMs, at http://3-4.fr/1bG87uF