Whitelist databricks to read and write into Azure Storage account

Chand, Anupam SBOBNG-ITA/RX 471 Reputation points
2022-10-07T07:03:51.827+00:00

We have a Databricks which is currently integrated with Azure datalake Gen 1. The databricks is not injected within a VNET. The Datalake has IP whitelisting enabled and we have switched on and we have enabled the exception of allowing Azure services to access the datalake.
248370-image.png

We are now in the process of migrating the data lake gen1 to Azure data lake gen2 which is basically a storage account with hierarchical namespace enabled. We would like the similar whitelisting on the new storage. However, we do not see any similar to option to whitelist azure services on the storage account. There is another option available but this does not have Databricks on the trusted service list. We tried enabling it but are not able to successfully integrate our databricks to the new storage.
248355-image.png

We also tried selecting the resource instances as below but even then we were not able to connect to the new datalake from our ADB.
248377-image.png

Does this mean the only option we have is to recreate our databricks with an injected VNET or a natgateway to get an IP that we can whitelist? This seems like a very big exercise as we cannot modify our existing databricks and this process is not documented anywhere.

Can you please let us know the whitelisting settings required to integrate a non VNET injected Databricks with a Storage account which has IP whitelisting?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
Azure Storage
Azure Storage
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
0 comments No comments
{count} votes

Answer accepted by question author
  1. SaiKishor-MSFT 17,336 Reputation points Moderator
    2022-10-07T20:09:47.637+00:00

    @Chand, Anupam SBOBNG-ITA/RX Thanks for reaching out to Microsoft Q&A. I understand that you want to allow access to your Azure Storage Account from Databricks but you see that its not part of the trusted services list.

    Please refer to this similar thread- https://stackoverflow.com/questions/54018584/azure-databricks-accessing-blob-storage-behind-firewall

    Yes, the Azure Databricks does not count as a trusted Microsoft service, you could see the supported trusted Microsoft services with the storage account firewall.

    Here are two suggestions:

    1. Find the Azure datacenter IP address and scope a region where your Azure Databricks located. Whitelist the IP list in the storage account firewall.
    2. Deploy Azure Databricks in your Azure Virtual Network (Preview) then whitelist the VNet address range in the firewall of the storage account. You could refer to configure Azure Storage firewalls and virtual networks. Also, you have NSG to restrict inbound and outbound traffics from this Azure VNet. Note: you need to deploy Azure Databricks to your own VNet.

    Hope this helps. Please let us know if you have any more questions and we will be glad to assist you further. Thank you!

    Remember:

    Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

    Want a reminder to come back and check responses? Here is how to subscribe to a notification.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. vb 0 Reputation points
    2023-11-29T15:02:39.3233333+00:00

    Hi SaiKishor,

    We tried to white list the Azure Databricks IP addresses into our Blob Storage Account firewall, as per your suggestion 1, but Databricks is still failing to connect to the Storage, raising the error message below:

    "This request is not authorized to perform this operation.", 403, GET, https://<storage_account>.dfs.core.windows.net/<container>?upn=false&resource=filesystem&maxResults=5000&timeout=90&recursive=false, AuthorizationFailure, "This request is not authorized to perform this operation. RequestId:xxx Time:xxx"

    We are confident this problem is caused by the Storage Firewall, as Databricks connects successfully as soon as the Firewall is disabled in the Storage, with the same credentials.

    We started with the list of IP addresses below, for the Azure region of our Databricks service (UK South):

    https://learn.microsoft.com/en-us/azure/databricks/resources/supported-regions

    That didn't work, so we added all Databricks IPs listed here:

    https://www.microsoft.com/en-us/download/details.aspx?id=56519 (from your link above), which also failed.

    Are these definitely the complete list of Azure Databricks IPs to white list? If yes, any ideas on what else could be happening here?

    We also used a local Python script to connect to the same Storage using the same credentials. That failed when the local IP address was not white listed, with the same error message and worked when the local IP was added to the Firewall. This gives us further confidence that it's a Firewall problem with the Databricks connectivity.

    Many thanks in advance!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.