Azure HDInsight On-Demand Linked Service

 

Important

This topic provides the JSON format that is supported by the older versions of Azure PowerShell. If you are using the July 2015 Release version of Azure PowerShell or later, see Azure HDInsight On-Demand Linked Service for the latest JSON format. You can convert the JSON from old format to new format by using the tool: JSON Upgrade Tool

You create an on-demand Azure HDInsight linked service to link an on-demand Azure HDInsight cluster to an Azure data factory. This topic describes the JSON properties for defining an on-demand HDInsight linked service. When you use the Data Factory Editor in Azure classic portal, the editor provides you a JSON template for defining a linked service that you can modify and deploy to create a linked service. To use the T:Microsoft.Azure.Commands.DataFactories.New-AzureDataFactoryLinkedService cmdlet to create a linked service, you will need to create a JSON file and use it with the cmdlet.

The on-demand HDInsight cluster is automatically created by the Azure Data Factory service to process data. The on-demand HDInsight cluster is created in the data center location that is same as that of the storage account (linkedServiceName property in the JSON) associated with the on-demand HDInsight cluster. You will not see it created in  your subscription; the Azure Data Factory service manages the on-demand HDInsight cluster for you. The logs for the jobs that are run on the on-demand HDInsight cluster are copied to the storage account associated with the HDInsight cluster. You can access these logs from the Data Factory portal in the Activity Run Details blade.

Warning

It takes 15+ minutes for the HDInsight cluster to be created on demand. You will be charged only for the time when the HDInsight cluster is up and running jobs.

Important

This topic provides the JSON format that you can use to define an on-demand Azure HDInsight linked service with old versions of API (prior to 2015-05-01-preview). If you are using the newer/latest version of API, see Azure HDInsight On-Demand Linked Service for the new/latest JSON format.

Properties

Property

Description

type

The type property should be set to HDInsightOnDemandLinkedService.

clusterSize

The size of the on-demand cluster. Specify how many nodes you want to be in this on-demand cluster.

jobscontainer

The blob container that holds data used by pig/hive/package jobs and where the cluster logs will be stored.

timetolive

Specifies how long the on-demand HDInsight cluster stays alive after it is done with processing a slice. If another slice needs to be processed while this cluster is alive, the slice is processed by the same cluster.

For example, if the processing of a slice takes 6 minutes and timetolive is set to 5 minutes, the cluster stays alive for 5 minutes after the 6 minutes of processing the current slice. If another slice is ready to run, the slice is processed by the same cluster. The cluster stays alive for 5 minutes after processing the second slice. Basically, as long as the cluster is not idle for more than timetolive, it stays alive to process more slices.

Creating an on-demand HDInsight cluster is an expensive operation (could take a while), so use this setting as needed to improve performance of a data factory by reusing an on-demand HDInsight cluster.

If you set this value to 0, the cluster is deleted as soon as the slice is processed. If you set a high value and it stays idle because there are no slices to process, you incur unnecessary charges from running the cluster. Therefore, you set it an appropriate value to suit your needs.

TimeToLive property cross the pipeline boundaries. If two pipelines (or actually activities in the pipelines) point to the same on-demand HDInsight cluster, they may use the same instance of on-demand HDInsight cluster, i.e., slices processed by these two pipelines may share the same on-demand HDInsight cluster.

version

Version of the HDInsight cluster. This property is optional.

linkedServiceName

The blob store to be used by the on-demand cluster for storing and processing data.

additionalLinkedServiceNames

Specifies additional storage accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf.

You can also specify the following properties for creating/configuring the on-demand HDInsight cluster.

Property

Description

coreConfiguration

Specifies the core configuration parameters (as in core-site.xml) for the HDInsight cluster to be created.

hBaseConfiguration

Specifies the HBase configuration parameters (hbase-site.xml) for the HDInsight cluster.

hdfsConfiguration

Specifies the HDFS configuration parameters (hdfs-site.xml) for the HDInsight cluster.

hiveConfiguration

Specifies the hive configuration parameters (hive-site.xml) for the HDInsight cluster.

mapReduceConfiguration

Specifies the MapReduce configuration parameters (mapred-site.xml) for the HDInsight cluster.

oozieConfiguration

Specifies the Oozie configuration parameters (oozie-site.xml) for the HDInsight cluster.

stormConfiguration

Specifies the Storm configuration parameters (storm-site.xml) for the HDInsight cluster.

yarnConfiguration

Specifies the Yarn configuration parameters (yarn-site.xml) for the HDInsight cluster.

See the Example 3 – with cluster creation parameters section for an example of using some of these properties.

Examples

Example 1

{
    "name": "HDInsightOnDemandCluster",
    "properties": 
    {
        "type": "HDInsightOnDemandLinkedService",
        "clusterSize": 4,
        "jobsContainer": "adfjobs",
        "timeToLive": "00:05:00",
        "version": "3.1",
        "linkedServiceName": "MyBlobStore"
    }
}

Example 2 – with additionalLinkedServiceNames

{
    "name": "MyHDInsightOnDemandLinkedService",
    "properties":
    {
        "type": "HDInsightOnDemandLinkedService",
        "clusterSize": 1,
        "timeToLive": "00:01:00",
        "linkedServiceName": "LinkedService-SampleData",
        "additionalLinkedServiceNames": [ "otherLinkedServiceName1", "otherLinkedServiceName2" ] 
    }
}

Example 3 – with cluster creation parameters

{
    "name": "ASOSHDInsightCluster1",
    "properties": 
    {
        "type": "HDInsightOnDemandLinkedService",
        "clusterSize": 16,
        "jobsContainer": "adfjobs",
        "timeToLive": "01:30:00",
        "version": "3.1",
        "linkedServiceName": "adfods1",
        "coreConfiguration" : 
        {
            "templeton.mapper.memory.mb": "5000"
        }, 
        "hiveConfiguration" :
        {
            "templeton.mapper.memory.mb": "5000"
        }, 
        "mapReduceConfiguration" :
        {
            "mapreduce.reduce.java.opts": "-Xmx4000m",
            "mapreduce.map.java.opts": "-Xmx4000m",
            "mapreduce.map.memory.mb": "5000",
            "mapreduce.reduce.memory.mb": "5000",
            "mapreduce.job.reduce.slowstart.completedmaps":"0.8"            
        }, 
        "yarnConfiguration" :
        {
            "yarn.app.mapreduce.am.resource.mb":"5000",
            "mapreduce.map.memory.mb": "5000" 
        }, 
        "additionalLinkedServiceNames" :["datafeeds", "adobedatafeed"]        
    }
}