Access Azure Data Lake Storage Gen2 and Blob Storage
Use the Azure Blob Filesystem driver (ABFS) to connect to Azure Blob Storage and Azure Data Lake Storage Gen2 from Azure Databricks. Databricks recommends securing access to Azure storage containers by using Azure service principals set in cluster configurations.
Note
Databricks no longer recommends mounting external data locations to Databricks Filesystem. See Mounting cloud object storage on Azure Databricks.
This article details how to access Azure storage containers using:
- Unity Catalog managed external locations
- Azure service principals
- SAS tokens
- Account keys
You will set Spark properties to configure these credentials for a compute environment, either:
- Scoped to an Azure Databricks cluster
- Scoped to an Azure Databricks notebook
Azure service principals can also be used to access Azure storage from Databricks SQL; see Data access configuration.
Databricks recommends using secret scopes for storing all credentials.
Deprecated patterns for storing and accessing data from Azure Databricks
The following are deprecated storage patterns:
- Databricks no longer recommends Access Azure Data Lake Storage using Azure Active Directory credential passthrough (legacy).
- The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. ABFS has numerous benefits over WASB. See Azure documentation on ABFS. For documentation for working with the legacy WASB driver, see Connect to Azure Blob Storage with WASB (legacy).
- Azure has announced the pending retirement of Azure Data Lake Storage Gen1. Azure Databricks recommends migrating all Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2. If you have not yet migrated, see Accessing Azure Data Lake Storage Gen1 from Azure Databricks.
Access Azure Data Lake Storage Gen2 with Unity Catalog external locations
Note
Azure Data Lake Storage Gen2 is the only Azure storage type supported by Unity Catalog.
Unity Catalog manages access to data in Azure Data Lake Storage Gen2 using external locations. Administrators primarily use external locations to configure Unity Catalog external tables, but can also delegate access to users or groups using the available privileges (READ FILES
, WRITE FILES
, and CREATE TABLE
).
Use the fully qualified ABFS URI to access data secured with Unity Catalog. Because permissions are managed by Unity Catalog, you do not need to pass any additional options or configurations for authentication.
Warning
Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.
Examples of reading:
dbutils.fs.ls("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")
spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")
spark.sql("SELECT * FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`")
Examples of writing:
dbutils.fs.mv("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")
df.write.format("parquet").save("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")
Examples of creating external tables:
df.write.option("path", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table").saveAsTable("my_table")
spark.sql("""
CREATE TABLE my_table
LOCATION "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table"
AS (SELECT *
FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`)
""")
Direct access using ABFS URI for Blob Storage or Azure Data Lake Storage Gen2
If you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss
driver for greater security.
spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
CREATE TABLE <database-name>.<table-name>;
COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');
Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal
You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication; see Access storage with Azure Active Directory.
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Replace
<scope>
with the Databricks secret scope name.<service-credential-key>
with the name of the key containing the client secret.<storage-account>
with the name of the Azure storage account.<application-id>
with the Application (client) ID for the Azure Active Directory application.<directory-id>
with the Directory (tenant) ID for the Azure Active Directory application.
Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token
You can use storage shared access signatures (SAS) to access an Azure Data Lake Storage Gen2 storage account directly. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.
You can configure SAS tokens for multiple storage accounts in the same Spark session.
Note
SAS support is available in Databricks Runtime 7.5 and above.
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")
Access Azure Data Lake Storage Gen2 or Blob Storage using the account key
You can use storage account access keys to manage access to Azure Storage.
spark.conf.set(
"fs.azure.account.key.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))
Replace
<storage-account>
with the Azure Storage account name.<scope>
with the Azure Databricks secret scope name.<storage-account-access-key>
with the name of the key containing the Azure storage account access key.
Example notebook
ADLS Gen2 OAuth 2.0 with Azure service principals notebook
Azure Data Lake Storage Gen2 FAQs and known issues
Feedback
Submit and view feedback for