Custom cluster management clients

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Options for provisioning (creating) and deleting an HDInsight cluster include:

The correct approach to cluster provisioning depends on the specific business requirements and constraints, but the following table describes typical approaches in relation to the common big data use cases and models discussed in this guide.

Use case

Considerations

Iterative data exploration

Creating and deleting the cluster manually when required through the Azure management portal may be acceptable for data exploration scenarios where data processing and analysis is performed interactively on an occasional basis by a dedicated team of data analysts. However, if the analysis is more frequent the analysts might benefit from creating a simple script or command line utility to automate the process of creating and deleting the cluster.

Data warehouse on demand

Data warehouses built on HDInsight are usually based on Hive tables, and the cluster must be running to service Hive queries. If the data warehouse is queried directly by users and applications, you may need to keep the cluster running continually. However, if the data warehouse is used only as a data source for analytical data models (for example, in SQL Server Analysis Services or PowerPivot workbooks) or for cached reports you can create the cluster on demand to enable new data to be processed, refresh the dependent data models and reports, and then delete the cluster.

ETL automation

When HDInsight is used to filter and shape data in an ETL process, the destination of the transformed data is usually another data store such as a SQL Server database. Depending on the frequency of the ETL cycle, you may choose to include provisioning and deletion of the cluster in the ETL process itself. In this case, cluster creation and deletion are likely to be automated along with data ingestion, job execution, and the data transfer tasks of the ETL workflow.

BI integration

In a managed BI solution, where HDInsight is used primarily as a means of preparing big data for inclusion in an existing enterprise BI data warehouse or data models, the cluster provisioning requirements are likely to be similar to those of the data warehouse on demand and ETL automation models. If the HDInsight cluster must support self-service BI that includes direct big data processing by business users, you may need to consider keeping the cluster online continually.

Considerations

When planning how you will create a cluster for your solution, also consider the following points:

  • As part of the cluster provision process you may also need to create or manage storage accounts. Often you will do this only once, and use the storage account each time you run your automated solution. For more information see Cluster and storage initialization.
  • You should set all the properties for your cluster when you create it, using the techniques described in this section of the guide. This ensures that the configuration is fixed in the cluster definition, and will be reapplied to any virtual servers that make up the cluster if they are automatically restarted after a failure or an upgrade. Virtual server management within the datacenter may occur at any time, and you cannot control this. If you edit the configuration files directly, any changes will be lost when a server restarts. However, you can change some cluster properties for individual jobs—see Configuring and debugging solutions for details
  • Be careful how and when you delete a cluster as part of an automated solution. You may need to implement a task that backs up the data and/or the metadata first. Ensure tools that allow users to delete clusters perform user authentication and authorization to protect against accidental and malicious use.

More information

For information about creating end-to-end automated solutions that include automated cluster management stages, see Building end-to-end solutions using HDInsight.

For more details of the tools and technologies available for automating cluster management see Appendix A - Tools and technologies reference.

For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference Documentation.

For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the incubator projects on the CodePlexwebsite.

The topic Provision HDInsight clusters on the Azure website shows several ways that you can provision a cluster.

Next Topic | Previous Topic | Home | Community