Create a cluster with Data Lake Storage Gen2 using Azure CLI

To create an HDInsight cluster that uses Data Lake Storage Gen2 for storage, follow these steps.

Prerequisites

  • If you're unfamiliar with Azure Data Lake Storage Gen2, check out the overview section.
  • If you don't already have an Azure account, sign up for a free account before continuing.
  • To run the CLI script examples, you have three options:
    • Use Azure Cloud Shell from the Azure portal (see next section).
    • Use the embedded Azure Cloud Shell via the "Try It" button, located in the top-right corner of each code block.
    • Install the latest version of the Azure CLI (2.0.13 or later) if you prefer to use a local CLI console. Sign in to Azure using az login, using an account that is associated with the Azure subscription under which you would like to deploy the user-assigned managed identity.Azure CLI.

Azure Cloud Shell

Azure hosts Azure Cloud Shell, an interactive shell environment that you can use through your browser. You can use either Bash or PowerShell with Cloud Shell to work with Azure services. You can use the Cloud Shell preinstalled commands to run the code in this article, without having to install anything on your local environment.

To start Azure Cloud Shell:

Option Example/Link
Select Try It in the upper-right corner of a code or command block. Selecting Try It doesn't automatically copy the code or command to Cloud Shell. Screenshot that shows an example of Try It for Azure Cloud Shell.
Go to https://shell.azure.com, or select the Launch Cloud Shell button to open Cloud Shell in your browser. Button to launch Azure Cloud Shell.
Select the Cloud Shell button on the menu bar at the upper right in the Azure portal. Screenshot that shows the Cloud Shell button in the Azure portal

To use Azure Cloud Shell:

  1. Start Cloud Shell.

  2. Select the Copy button on a code block (or command block) to copy the code or command.

  3. Paste the code or command into the Cloud Shell session by selecting Ctrl+Shift+V on Windows and Linux, or by selecting Cmd+Shift+V on macOS.

  4. Select Enter to run the code or command.

Warning

Billing for HDInsight clusters is prorated per minute, whether you use them or not. Be sure to delete your cluster after you finish using it. See how to delete an HDInsight cluster.

You can download a sample template file and download a sample parameters file. Before using the template and the Azure CLI code snippet below, replace the following placeholders with their correct values:

Placeholder Description
<SUBSCRIPTION_ID> The ID of your Azure subscription
<RESOURCEGROUPNAME> The resource group where you want the new cluster and storage account created.
<MANAGEDIDENTITYNAME> The name of the managed identity that will be given permissions on your storage account with Azure Data Lake Storage Gen2.
<STORAGEACCOUNTNAME> The new storage account with Azure Data Lake Storage Gen2 that will be created.
<FILESYSTEMNAME> The name of the filesystem that this cluster should use in the storage account.
<CLUSTERNAME> The name of your HDInsight cluster.
<PASSWORD> Your chosen password for signing in to the cluster using SSH and the Ambari dashboard.

The code snippet below does the following initial steps:

  1. Logs in to your Azure account.
  2. Sets the active subscription where the create operations will be done.
  3. Creates a new resource group for the new deployment activities.
  4. Creates a user-assigned managed identity.
  5. Adds an extension to the Azure CLI to use features for Data Lake Storage Gen2.
  6. Creates a new storage account with Data Lake Storage Gen2 by using the --hierarchical-namespace true flag.
az login
az account set --subscription <SUBSCRIPTION_ID>

# Create resource group
az group create --name <RESOURCEGROUPNAME> --location eastus

# Create managed identity
az identity create -g <RESOURCEGROUPNAME> -n <MANAGEDIDENTITYNAME>

az extension add --name storage-preview

az storage account create --name <STORAGEACCOUNTNAME> \
    --resource-group <RESOURCEGROUPNAME> \
    --location eastus --sku Standard_LRS \
    --kind StorageV2 --hierarchical-namespace true

Next, sign in to the portal. Add the new user-assigned managed identity to the Storage Blob Data Owner role on the storage account. This step is described in step 3 under Using the Azure portal.

Important

Ensure that your storage account has the user-assigned identity with Storage Blob Data Owner role permissions, otherwise cluster creation will fail.

az deployment group create --name HDInsightADLSGen2Deployment \
    --resource-group <RESOURCEGROUPNAME> \
    --template-file hdinsight-adls-gen2-template.json \
    --parameters parameters.json

Clean up resources

After you complete the article, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it's not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use.

Enter all or some of the following commands to remove resources:

# Remove cluster
az hdinsight delete \
    --name $clusterName \
    --resource-group $resourceGroupName

# Remove storage container
az storage container delete \
    --account-name $AZURE_STORAGE_ACCOUNT \
    --name $AZURE_STORAGE_CONTAINER

# Remove storage account
az storage account delete \
    --name $AZURE_STORAGE_ACCOUNT \
    --resource-group $resourceGroupName

# Remove resource group
az group delete \
    --name $resourceGroupName

Troubleshoot

If you run into issues with creating HDInsight clusters, see access control requirements.

Next steps

You've successfully created an HDInsight cluster. Now learn how to work with your cluster.

Apache Spark clusters

Apache Hadoop clusters

Apache HBase clusters