Azure Databricks - Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Gopinath Rajee 656 Reputation points
2022-05-14T20:32:47.183+00:00

All,

I tried setting the connection details at the cluster level based on the following link and it works. But it is indicating that I use the Secret directly. Am I missing something? How can I make this work without having to directly specify the secret.

The replace sections has the extra line that needs to be removed as well
**

<service-credential-key-name> with the name of the key containing the client secret.

**

spark.hadoop.fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,236 Reputation points
    2022-05-16T19:33:04.927+00:00

    Hello @Gopinath Rajee ,
    Thanks for the question and using MS Q&A platform.

    As I understand, the ask is to get credentials out of the cluster configuration and store them securely elsewhere. The configuration should point to the credentials, but not expose them. This is specific to using the RDD option where details are specified in the cluster configuration as opposed to in the notebook like all the other options.
    Please look at how to retrieve spark configuration from a secret. You will first need to set up the secrets as referenced later in this post.

    Do note that secrets in Spark configuration are in public preview and available in Databricks Runtime 6.4 Extended Support and above. Link
    Please read the details, as there are still security concerns with this method. Namely, notebooks can get the secret because they can get configuration properties. Also this is not redacted.
    As such, I highly recommend you use a different method to specify your connection instead of this RDD and cluster config.

    spark.<property-name> {{secrets/<scope-name>/<secret-name>}}  
    

    There should be no space between the two { . Q&A is forcing some formatting.

    Info on secrets in general:

    Link to Secret Management in Databricks.

    In Databricks the mechanism for doing this is called Secret Scopes. There are 2 options for where to store the secrets -- Key Vault backed secret scopes and Databricks backed secret scopes.
    In both cases, the code to fetch the secret is the same. dbutils.secrets.get(scope = "myScopeName", key = "mySecretName")

    Link to secret workflow.
    Link to workflow specific to ADLS Gen2 and OAuth2.

    Please do let me if you have any queries.

    Thanks
    Martin


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.