Connect to Azure Data Lake Storage Gen2 and Blob Storage
Article
Note
This article describes legacy patterns for configuring access to Azure Data Lake Storage Gen2. Databricks recommends using Unity Catalog to configure access to Azure Data Lake Storage Gen2 and volumes for direct interaction with files. See Connect to cloud object storage and services using Unity Catalog.
This article explains how to connect to Azure Data Lake Storage Gen2 and Blob Storage from Azure Databricks.
Connect to Azure Data Lake Storage Gen2 or Blob Storage using Azure credentials
The following credentials can be used to access Azure Data Lake Storage Gen2 or Blob Storage:
OAuth 2.0 with a Microsoft Entra ID service principal: Databricks recommends using Microsoft Entra ID service principals to connect to Azure Data Lake Storage Gen2. To create a Microsoft Entra ID service principal and provide it access to Azure storage accounts, see Access storage using a service principal & Microsoft Entra ID(Azure Active Directory).
To create a Microsoft Entra ID service principal, you must have the Application Administrator role or the Application.ReadWrite.All permission in Microsoft Entra ID. To assign roles on a storage account you must be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.
Important
Blob storage does not support Microsoft Entra ID service principals.
Shared access signatures (SAS): You can use storage SAS tokens to access Azure storage. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.
You can only grant a SAS token permissions that you have on the storage account, container, or file yourself.
Account keys: You can use storage account access keys to manage access to Azure Storage. Storage account access keys provide full access to the configuration of a storage account, as well as the data. Databricks recommends using a Microsoft Entra ID service principal or a SAS token to connect to Azure storage instead of account keys.
To view an account’s access keys, you must have the Owner, Contributor, or Storage Account Key Operator Service role on the storage account.
Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the Azure credentials while allowing users to access Azure storage. To create a secret scope, see Manage secret scopes.
Set Spark properties to configure Azure credentials to access Azure storage
You can set Spark properties to configure a Azure credentials to access Azure storage. The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to Azure storage. See Compute permissions and Collaborate using Databricks notebooks.
Note
Microsoft Entra ID service principals can also be used to access Azure storage from a SQL warehouse, see Enable data access configuration.
To set Spark properties, use the following snippet in a cluster’s Spark configuration or a notebook:
Azure service principal
Use the following format to set the cluster Spark configuration:
<storage-account> with the Azure Storage account name.
<scope> with the Azure Databricks secret scope name.
<storage-account-access-key> with the name of the key containing the Azure storage account access key.
Access Azure storage
Once you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss driver for greater security.
If you try accessing a storage container created through the Azure portal, you might receive the following error:
StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.
When a hierarchical namespace is enabled, you don’t need to create containers through Azure portal. If you see this issue, delete the Blob container through Azure portal. After a few minutes, you can access the container. Alternatively, you can change your abfss URI to use a different container, as long as this container is not created through Azure portal.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.