Access Azure Data Lake Storage Gen2 and Blob Storage

Use the Azure Blob Filesystem driver (ABFS) to connect to Azure Blob Storage and Azure Data Lake Storage Gen2 from Azure Databricks. Databricks recommends securing access to Azure storage containers by using Azure service principals set in cluster configurations.

This article details how to access Azure storage containers using:

  • Unity Catalog managed external locations
  • Azure service principals
  • SAS tokens
  • Account keys

You will set Spark properties to configure these credentials for a compute environment, either:

  • Scoped to an Azure Databricks cluster
  • Scoped to an Azure Databricks notebook

Azure service principals can also be used to access Azure storage from Databricks SQL; see Configure access to cloud storage.

Databricks recommends using secret scopes for storing all credentials.

Deprecated patterns for storing and accessing data from Azure Databricks

The following are deprecated storage patterns:

Access Azure Data Lake Storage Gen2 with Unity Catalog external locations

Note

Azure Data Lake Storage Gen2 is the only Azure storage type supported by Unity Catalog.

Unity Catalog manages access to data in Azure Data Lake Storage Gen2 using external locations. Administrators primarily use external locations to configure Unity Catalog external tables, but can also delegate access to users or groups using the available privileges (READ FILES, WRITE FILES, and CREATE TABLE).

Use the fully qualified ABFS URI to access data secured with Unity Catalog. Because permissions are managed by Unity Catalog, you do not need to pass any additional options or configurations for authentication.

Warning

Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.

Examples of reading:

dbutils.fs.ls("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")

spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")

spark.sql("SELECT * FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`")

Examples of writing:

dbutils.fs.mv("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")

df.write.format("parquet").save("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/new-location")

Examples of creating external tables:

df.write.option("path", "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table").saveAsTable("my_table")

spark.sql("""
  CREATE TABLE my_table
  LOCATION "abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/table"
  AS (SELECT *
    FROM parquet.`abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data`)
""")

Direct access using ABFS URI for Blob Storage or Azure Data Lake Storage Gen2

If you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss driver for greater security.

spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
CREATE TABLE <database-name>.<table-name>;

COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');

Access Azure Data Lake Storage Gen2 or Blob Storage using OAuth 2.0 with an Azure service principal

You can securely access data in an Azure storage account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication; see Access storage with Azure Active Directory.

service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace

  • <scope> with the Databricks secret scope name.
  • <service-credential-key> with the name of the key containing the client secret.
  • <storage-account> with the name of the Azure storage account.
  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.

Access Azure Data Lake Storage Gen2 or Blob Storage using a SAS token

You can use storage shared access signatures (SAS) to access an Azure Data Lake Storage Gen2 storage account directly. With SAS, you can restrict access to a storage account using temporary tokens with fine-grained access control.

You can configure SAS tokens for multiple storage accounts in the same Spark session.

Note

SAS support is available in Databricks Runtime 7.5 and above.

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", "<token>")

Access Azure Data Lake Storage Gen2 or Blob Storage using the account key

You can use storage account access keys to manage access to Azure Storage.

spark.conf.set(
    "fs.azure.account.key.<storage-account>.dfs.core.windows.net",
    dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

Replace

  • <storage-account> with the Azure Storage account name.
  • <scope> with the Azure Databricks secret scope name.
  • <storage-account-access-key> with the name of the key containing the Azure storage account access key.

Example notebook

ADLS Gen2 OAuth 2.0 with Azure service principals notebook

Get notebook

Azure Data Lake Storage Gen2 FAQs and known issues

See Azure Data Lake Storage Gen2 FAQ.