Gluster on Azure - Part 1
How do you store and share large files, and I mean really large files, on Linux VMs on Azure? Maybe it’s best if I describe it this way.. I have several large files with sizes from 500GB to 1TB+ with a total size of 100TB+. I want to be able to share these files across multiple Linux VM’s in Azure and access them using the standard I/O calls of Linux.
Azure has several storage solutions including Azure Blob and Azure File Share. The Azure Blob has maximum file size of 1TB if you store the files as page blobs or 200GB if you store the files as blocks with a total maximum size of several hundred terabytes. It’s important to note that when you attach a disk to your VM’s the disk is actually created on the Azure blob (keep this in mind because it will come in handy later in this article). But unfortunately once you attach a disk to a Linux VM, you can’t attach the same disk to another VM.
Azure File Share is a little better as it does in fact lets you share files across multiple VMs using the standard SMB 2.1 protocol. Microsoft Azure virtual machines and cloud services can share file data across application components via mounted shares, and on-premises applications can access file data in a share via the File storage API. Since a File storage share is a standard SMB 2.1 file share, applications running in Azure can access data in the share via file I/O APIs. It’s important to note that the maximum file size on the share is 1TB and the total maximum share size is 5TB.
So here’s the problem again: How do you share a file that’s 1TB+ across multiple VM’s running Linux?
The Gluster File System (GlusterFS) is an open source distributed file system that can scale out in building-block fashion to store multiple petabytes of data. Through out this article I will walk you through step by step instructions on your setup GlusterFS on Azure.
In this article I will setup 1 storage account and 3 Linux VM’s running Ubuntu 14.04. Two of the Ubuntu VM’s will host my Gluster file system and the last VM will act as the client which will access the files on the Gluster servers. Of course you’re not limited to this configuration and can add or remove machines both as a Gluster server and as a Client.
A sample Gluster Topology in Azure
1. The first step in the process is to create a storage account. From the Azure Management Portal, Click onNew, Data Services, Storage, Quick Create. Type in the name of your storage account as well as selecting the location etc.
Create a New Storage account
2. We need to create several Linux VM’s (I chose Ubuntu 14.04) by selecting: New, Compute, Virtual Machines, From Gallery. Once the wizard starts, select Ubuntu from the left menu and then choose Ubuntu Server 14.04 LTS. Click on the Next arrow to go to next page.
3. Here you input the regular standard VM setup things including the VM name, username, password etc. Take a note of how you’ll be connecting to the VM (via username and password vs. using certificate). This method of authentication will be used to connect to the machine using SSH. Once you’re done with this configuration click on the Next arrow.
4. On this page the important piece is to select the storage account that we created in Step 1 and finish the wizard.
Select Storage Account in VM wizard
5. Repeat the above 4 steps for all your VMs.
6. Once finished go to the dashboard page for each server. On this page take a note of the server ports and configuration on the right hand side. On the bottom of the page select Attach Disk. On the screen input the disk size. I am chose 750GB for testing but you can attach any disk up to 1TB. The size restriction is because the disks are stored as blob files in the Azure blob storage which limits the file sizes to 1TB. Repeat this step for all your other servers.
VM Info and Disk
7. I use Putty to connect to the Linux VM’s and for simplicity I inputted all the connection info (found on the right-hand side of the Dashboard page of each VM) into Putty. Connect and login to your first VM server.
[Setting up the newly attached disk]
8. In order to initialize and mount the new hard drive, you must find the drive name. If you go to the Azure site here, you will see that it says to run
# sudo grep SCSI /var/log/messages
or if you’re running Ubuntu run:
# sudo grep SCSI /var/log/syslog
Unfortunately neither of these worked for me. But finding the newly added drive is fairly easy. The drive that you attached is normally the last drive in the /dev directory that starts with “sd” followed by a letter (not a letter and a number but just a letter). For me, it was sdc. So in order to initialize the drive we need to run the following:
NOTE: You have to run the following commands as a root or a super user. (run sudo su or you can type sudo to the beginning of all commands).
|
9. Next we need to format the disk.
[root]# mkfs.ext4 -L dataDisk /dev/sdc1
10. We need to mount our formatted drive.
[root]# mount /dev/sdc1 /mnt/sharedDisk
And that’s it for the disks. Again make sure you do the above steps on all your servers.
[Installing GlusterFS]
11. Before installing Gluster we need to setup its repository. For the latest Gluster version and package information check out this site. Since I am using Ubuntu I can use the Ubuntu package manager. The first step is to add the Gluster PPA. This way I don’t need to worry about manually downloading and building the source code. This also makes updating Gluster a lot easier.
add-apt-repository ppa:semiosis/ubuntu-glusterfs-3.5 apt-get update
Finally, install the packages:
apt-get install glusterfs-server
To ensure that Gluster has been installed properly run the following:
glusterfs --version
Repeat the above steps on all your servers that will be part of the Gluster cluster.