Invalid configuration value detected for fs.azure.account.key using Azure Databricks autoloader

Thomas Bailey 11 Reputation points
2022-07-14T12:46:28.457+00:00

The following is true of my setup:

  1. The cluster has its spark config set to apply the data lake's endpoint and account key.
  2. I have pre-deployed system topics & queue (via IaC ARM template YAML deployments) which are successfully receiving events. The example here is named 'queue1'.

The following masked & anonymised PySpark code errors with

Error while reading file abfss:REDACTED_LOCAL_PART@datalake.dfs.core.windows.net/<folder_name>/2022/07/14/<file_name>.json
Invalid configuration value detected for fs.azure.account.key.
Caused by: Invalid configuration value detected for fs.azure.account.key

The schema variable is in preceding code but returns a valid struct-based schema.

#cloudFiles config  
cloudFiles_cfg = {  
  "cloudFiles.subscriptionId": "61******-****-****-****-***********7",  
  "cloudFiles.tenantId": "14******--****-****-****-***********f",  
  "cloudFiles.clientId": "07******-****-****-****-***********e",  
  "cloudFiles.clientSecret": "***************************",  
  "cloudFiles.resourceGroup": "rg-datahub",  
  "cloudFiles.connectionString" : "BlobEndpoint=https://datalake.blob.core.windows.net/;QueueEndpoint=https://datalake.queue.core.windows.net/;FileEndpoint=https://datalake.file.core.windows.net/;TableEndpoint=https://datalake.table.core.windows.net/;SharedAccessSignature=sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupx&se=2032-07-14T20:01:35Z&st=2022-07-14T12:01:35Z&spr=https&sig=***************************************",  
  "cloudFiles.storageAccount": "datalake",  
  "cloudFiles.format": "json",  
  "cloudFiles.useNotifications": "true",  
  "cloudFiles.queueName": "queue1",  
}  
  
incoming = (spark.readStream  
              .format("cloudFiles")   
              .options(**cloudFiles_cfg)   
              .schema(schema)  
              .load()   
           )  
display(incoming)  

On executing the stream, the follow behaviour occurs

  1. The stream initialises successfully.
  2. If the queue is empty, the stream continues to poll happily, returning blank results.
  3. As soon as a message is added to the queue, the stream processes the message but fails returning the error.
  4. After the stream has failed, the message is nonetheless dequeued.

Looking for reasons why this error would occur & potential resolutions.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,426 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,080 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Thomas Bailey 11 Reputation points
    2022-07-14T13:00:17.86+00:00

    OK, a few minutes later I find the answer (keep doing this).

    Turns out account key is not sufficient for the abfss protocol, so I've added the following configs:

    spark.conf.set(
    "fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net",
    "OAuth")
    spark.conf.set(
    "fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set(
    "fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net",
    "<application-id>")
    spark.conf.set(
    "fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net",
    dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
    spark.conf.set(
    "fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net",
    "https://login.microsoftonline.com/<directory-id>/oauth2/token")

    This returns data now.