Cluster and storage initialization

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

HDInsight stores its data in Azure blob storage. This enables you to manage the location of your data, and retain the data if you need to delete and recreate a cluster. It also enables flexibility in that you can store data for different applications in different storage accounts, initiate a cluster only when required, process data in the relevant storage account, and then delete the cluster afterwards to minimize running costs. To understand how you can do this, see the following sections of this topic:

  • Storage accounts and containers
  • Deleting and recreating clusters

For information about creating a cluster using scripts or code see Custom cluster management clients.

Storage accounts and containers

When you create a cluster, unless you specify otherwise, HDInsight creates a new storage account in the same datacenter as the cluster and uses the default container in this account to store its data. This account is automatically linked to the cluster (linked storage accounts are sometimes referred to as connected, configured, or managed storage accounts). The default container in this storage account is used by HDInsight as the root of the virtual HDFS store. When you access data using just the relative path (such as /myfiles/thisone.txt) HDInsight uses the default container.

You can add additional storage accounts during the cluster creation process. You can specify existing storage accounts, or you can allow HDInsight to create new storage accounts. These storage accounts are also linked to the cluster. HDInsight stores the credentials required to access all of the linked storage accounts in its configuration. Linked storage accounts are not deleted when you delete the cluster, which means that the data in them is retained and can be accessed afterwards.

However, you can process data from, and store the results in any blob storage container in any Azure storage account by specifying the account name and the storage key when you submit a job. This provides flexibility, and can help to improve the security and manageability of your solution. For example, you can store parts of your data in separate storage accounts to help protect and isolate sensitive information, or use different storage accounts to stage data as part of your ingestion process.

You can also reduce runtime costs by creating the storage account and loading the data before you create the cluster. Additionally, using non-linked storage accounts can help to maximize security by isolating data for different users or tenants and allowing each one to manage their own storage account and upload the data to it themselves, before you process the data in your HDInsight cluster.

Considerations

Keep in mind the following when deciding how and when you will create storage accounts for a cluster:

  • The main advantage of allowing HDInsight to create one or more storage accounts that are automatically linked to the cluster during the creation process is that you do not need to specify the storage account credentials, such as the storage account name and key, when you access the data in a query or transformation process running on your HDInsight cluster. HDInsight automatically stores the required credentials within its configuration. However, you will need to obtain the storage key when you want to upload data to the storage account and access the results.
  • The main advantage of using non-linked storage accounts and containers is the flexibility this provides in choosing the storage account to use with each job. However, you must specify the target storage account name and key within your query or transformation when you access data stored in accounts that are not linked to the cluster.
  • You can specify the storage accounts that are linked to the cluster only when you create the cluster. You cannot add or remove linked accounts after a cluster has been created. If you need more than one storage account to be linked to your cluster, you must specify them all as part of the cluster creation operation.
  • You can create the storage accounts before or after you create the cluster. Typically you will use this capability to minimize cluster runtime cost by creating the storage accounts (or using existing storage accounts) and loading the data before you create the cluster.
  • If you store parts of your data in different storage accounts, perhaps to separate sensitive data such as personally identifiable information (PII) and account information from non-sensitive data, you can create a cluster that uses just a subset of these as the linked accounts. This allows you to isolate and protect parts of the data while avoiding the need to specify storage account credentials in queries and transformations. Be aware, however, that code running in HDInsight will have full access to all of the data in a linked account because the account name and key are stored in the cluster configuration.
  • If you do not specify the storage account and path to the data when you submit a job, HDInsight will use the default container. If you intend to use accounts and containers other than the default, or delete and then recreate a cluster over the same data, specify the full path of the account and container in all queries and transformation processes that you will execute on your HDInsight cluster. This ensures that each job accesses the correct container, and prevents errors if you subsequently delete and recreate the cluster with different default containers. The full path and name of a container is in the form wasbs://[container-name]@[storage-account-name].blob.core.windows.net.
  • Any storage accounts associated with an HDInsight cluster should be in the same data center as the cluster, and must not be in an affinity group. Using a container in a storage account in a different datacenter will result in delays as data is transmitted between datacenters, and you will be billed for these data transfers.

Deleting and recreating clusters

When you delete a cluster, the data for that cluster is retained in the associated Azure blob storage containers. When you subsequently create a new cluster you can specify this container as the default container, and all of the data will be available for processing in the cluster. You can specify multiple existing storage accounts and containers when you create the cluster, which means that you can create a cluster over just the data that you want to process.

However, some metadata for the cluster is stored in an Azure SQL Database. This includes the definitions of any Hive tables you created with the EXTERNAL option, and HCatalog metadata that maps data files to schemas (the data for Hive tables is in blob storage). By default this database is created and populated with the cluster metadata automatically when a new cluster is created.

When the cluster is deleted, the database is also deleted. To avoid this you can use the option available when creating the cluster that allows you to specify an existing database to hold the cluster metadata. This database is not deleted when the cluster is deleted. When you recreate the cluster you can specify this database, and all of the metadata it contains will be available to the cluster. The Hive tables (and any indexes or other features they contain) and the HCatalog information will be available and accessible in the new cluster.

Considerations

Keep in mind the following when deleting and recreating your HDInsight clusters:

  • If you want to retain the schema definitions of Hive tables and the HCatalog metadata, you must specify an existing SQL Database instance when you create the cluster for the first time. If you allow HDInsight to create the database, it will be deleted when you delete the cluster.
  • The data for Hive tables you create in the cluster is retained only if you specify the EXTERNAL option when you create the tables.
  • You can back up and restore a SQL Database instance, and export or import the data, using the tools provided by the Azure management portal or through scripting using the REST interface for SQL Database.
  • Ensure you set the required configuration properties for a cluster when you create it. You can change some properties at runtime for individual jobs (see Configuring and debugging solutions for details), but you cannot change the properties of an existing cluster. See Custom cluster management clients for information about automating the creation of clusters and setting cluster properties.

Next Topic | Previous Topic | Home | Community