I'm getting an error when trying to run the demo spark word count in the data factory using HDInsight and a spark activity. All services were successfully created and tested. But when the spark pipeline is triggered, the following error is displayed:
Operation on target Wordcount failed: Failed to create the on demand HDI cluster. Cluster or linked service name: 'linkedService1', error: 'The storage connection string is invalid. Value cannot be null.
Parameter name: connectionString'
Looking at the HD Insight linked service shown below, there is no connection string.
Here's some details on my data factory setup. It's complicated by policy constraints....
Blob storage created with public access enabled only for selected virtual networks and Ips.
So I created a private endpoint connection for use by the data factory since I cannot open a network range in the storage account for the data factory.
Created self hosted integration runtime in the data factory as this is required for private endpoints.
HD Insight linked service uses the self hosted integration runtime to connect to blob storage. Tested connection to storage account and folder, both are successful.
Spark pipeline execution fails immediately with the error shown above.
Any ideas on how to proceed?
Linked Service has no property named connectionString:
{
"name": "linkedService1",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterType": "hadoop",
"clusterSize": 4,
"timeToLive": "00:15:00",
"version": "4.0",
"hostSubscriptionId": “xxxxxxx”,
"clusterResourceGroup": "scott-poc-rg",
"servicePrincipalId": “yyyyyyy”,
"osType": "Linux",
"tenant": “zzzzzzzzz”,
"clusterNamePrefix": "",
"clusterUserName": "scotts",
"clusterPassword": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "dfServicePrincipalKey"
},
"clusterSshUserName": "scotts",
"clusterSshPassword": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "dfServicePrincipalKey"
},
"servicePrincipalKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "AzureKeyVault1",
"type": "LinkedServiceReference"
},
"secretName": "dfServicePrincipalKey"
},
"additionalLinkedServiceNames": [],
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"headNodeSize": "",
"dataNodeSize": "",
"zookeeperNodeSize": ""
},
"connectVia": {
"referenceName": "integrationRuntime1-selfhosted",
"type": "IntegrationRuntimeReference"
}
}
}
Lastly, my spark job is connecting to storage as shown below:
lines = spark.read.text("wasbs://adfturorial@aaaaaac.blob.core.windows.net/spark/inputfiles/minecraftstory.txt").rdd.map(lambda r: r[0])
Its noted we also have a storage policy that only allows https access. Not sure if that is supported by wasbs?