Connect to cloud object storage using Unity Catalog

This article gives an overview of the cloud storage connection configurations that are required to work with data using Unity Catalog.

Databricks recommends using Unity Catalog to manage access to all data stored in cloud object storage. Unity Catalog provides a suite of tools to configure secure connections to cloud object storage. These connections provide access to complete the following actions:

  • Ingest raw data into a lakehouse.
  • Create and read managed tables in secure cloud storage.
  • Register or create external tables containing tabular data.
  • Read and write unstructured data.

Warning

Do not give end users storage-level access to Unity Catalog managed tables or volumes. This compromises data security and governance.

Granting users direct storage-level access to external location storage in Azure Data Lake Storage Gen2 does not honor any permissions granted or audits maintained by Unity Catalog. Direct access will bypass auditing, lineage, and other security and monitoring features of Unity Catalog, including access control and permissions. You are responsible for managing direct storage access through Azure Data Lake Storage Gen2 and ensuring that users have the appropriate permissions granted via Fabric.

Avoid all scenarios that grant direct storage-level write access for buckets that store Databricks managed tables. Modifying, deleting, or evolving any objects directly through storage that were originally managed by Unity Catalog can result in data corruption.

Note

If your workspace was created before November 9, 2023, it might not be enabled for Unity Catalog. An account admin must enable Unity Catalog for your workspace. See Enable a workspace for Unity Catalog.

How does Unity Catalog connect object storage to Azure Databricks?

Azure Databricks supports both Azure Data Lake Storage Gen2 containers and Cloudflare R2 buckets as cloud storage locations for data and AI assets registered in Unity Catalog. R2 is intended primarily for uses cases in which you want to avoid data egress fees, such as Delta Sharing across clouds and regions. For more information, see Use Cloudflare R2 replicas or migrate storage to R2.

To manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses the following object types:

  • A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant, using an Azure managed identity or service principal for Azure Data Lake Storage Gen2 containers or an R2 API token for Cloudflare R2 buckets. Each storage credential is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage credential in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf. Permission to create storage credentials should only be granted to users who need to define external locations. See Create a storage credential for connecting to Azure Data Lake Storage Gen2 and Create a storage credential for connecting to Cloudflare R2.

  • An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage location in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf. Permission to create and use external locations should only be granted to users who need to create external tables, external volumes, or managed storage locations. See Create an external location to connect cloud storage to Azure Databricks.

    External locations are used both for external data assets, like external tables and external volumes, and for managed data assets, like managed tables and managed volumes. For more information about the difference, see What is a table? and What are Unity Catalog volumes?.

    When an external location is used for storing managed tables and managed volumes, it is called a managed storage location. Managed storage locations can exist at the metastore, catalog, or schema level. Databricks recommends configuring managed storage locations at the catalog level. If you need more granular isolation, you can specify managed storage locations at the schema level. Workspaces that are enabled for Unity Catalog automatically have no metastore-level storage by default, but you can specify a managed storage location at the metastore level to provide default location when no catalog-level storage is defined. Workspaces that are enabled for Unity Catalog manually receive a metastore-level managed storage location by default. See Specify a managed storage location in Unity Catalog and Unity Catalog best practices.

Volumes are the securable object that most Azure Databricks users should use to interact directly with non-tabular data in cloud object storage. See What are Unity Catalog volumes?.

Note

While Unity Catalog supports path-based access to external tables and external volumes using cloud storage URIs, Databricks recommends that users read and write all Unity Catalog tables using table names and access data in volumes using /Volumes paths.

Best practices for cloud storage with Unity Catalog

Azure Databricks requires using Azure Data Lake Storage Gen2 as the Azure storage service for data that is processed in Azure Databricks using Unity Catalog governance. Azure Data Lake Storage Gen2 enables you to separate storage and compute costs and take advantage of the fine-grained access control provided by Unity Catalog. If data is stored in OneLake (the Microsoft Fabric data lake) and processed by Databricks (bypassing Unity Catalog), you will incur bundled storage and compute costs. This can lead to costs that are approximately 3x higher for reads and 1.6x higher for writes compared to Azure Data Lake Storage Gen2 for storing, reading, and writing data. Azure Blob Storage is also incompatible with Unity Catalog.

Feature Azure Blob Storage Azure Data Lake Storage Gen2 OneLake
Supported by Unity Catalog X X
Requires additional Fabric capacity purchase X X
Supported operations from external engines - Read
- Write
- Read
- Write
- Read (Reads incur 3x the cost compared to reading data from Azure Data Lake Storage Gen2).
- Writes are not supported.

For details, see the OneLake documentation.
Deployment Regional Regional Global
Authentication Entra ID Shared Access Signature Entra ID Shared Access Signature Entra ID
Storage events X
Soft delete
Access control RBAC RBAC, ABAC, ACL RBAC (Table/folder only, shortcut ACLs not supported)
Encryption keys X
Access tiers Online archive Hot, cool, cold, archive Hot only

Next steps

If you’re just getting started with Unity Catalog as an admin, see Set up and manage Unity Catalog.

If you’re a new user and your workspace is already enabled for Unity Catalog, see Tutorial: Create your first table and grant privileges.