Unity Catalog best practices

This document provides an opinionated perspective on how to best adopt Azure Databricks Unity Catalog and Delta Sharing to meet your data governance needs.

Configure a Unity Catalog metastore

Unity Catalog is a fine-grained governance solution for data and AI on the Databricks Lakehouse. It helps simplify security and governance of your data by providing a central place to administer and audit data access. The following diagram illustrates the main securable objects in Unity Catalog:

Unity Catalog object model diagram

A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access to them. Databricks account admins can create metastores and assign them to Databricks workspaces to control which workloads use each metastore. You create a single metastore in each region you operate and link it to all workspaces in that region. Therefore, if you have multiple regions using Databricks, you will have multiple metastores. To share data between metastores, see Delta Sharing.

Each metastore is configured with a root storage location, which is used for managed tables. You need to ensure that no users have direct access to this storage location. Giving access to the storage location could allow a user to bypass access controls in a Unity Catalog metastore and disrupt auditability. For these reasons, you should not reuse a container that is your current DBFS root file system or has previously been a DBFS root file system for the root storage location in your Unity Catalog metastore.

See Create a Unity Catalog metastore.

External locations and storage credentials

External locations and storage credentials allow Unity Catalog to read and write data on your cloud tenant on behalf of users.

A storage credential encapsulates a long-term cloud credential that provides access to cloud storage. It can either be an Azure managed identity (strongly recommended) or a service principal. Using an Azure managed identity has the following benefits over using a service principal:

  • You can connect to an Azure Data Lake Storage Gen2 account that is protected by a storage firewall.
  • Managed identities do not require you to maintain credentials or rotate secrets.

An external location is an object that combines a cloud storage path with a storage credential in order to authorize access to the cloud storage path.

Databricks recommends using external locations rather than using storage credentials directly. You should ensure that a limited number of users have direct access to a container that is being used as an external location. This is to limit users from bypassing access control in a Unity Catalog metastore and disrupting auditability. For these reasons, you should not mount storage accounts to DBFS that are being used as external locations.

Databricks recommends migrating mounts on cloud storage locations to external locations within Unity Catalog using Data Explorer.

See Manage external locations and storage credentials.

Organize your data

Databricks recommends using catalogs to provide segregation across your organization’s information architecture. Often this means that catalogs can correspond to software development environment scope, team, or business unit.

Unity Catalog catalogs

A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace and organizes tables and views. A table can be managed or external.

Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the Unity Catalog root storage location that you configured when you created a metastore. Databricks recommends using managed tables whenever possible to ensure support of Unity Catalog features. All managed tables use Delta Lake.

External tables are tables whose data is stored in a storage location outside of the managed storage location. They aren’t fully managed by Unity Catalog. External tables support Delta Lake and many other data formats, including Parquet, JSON, and CSV. External tables are a good option for providing direct access to raw data.

For more information on creating tables, see Create tables.

Manage external locations and external tables

The diagram below represents the filesystem hierarchy of a single cloud storage container.

External locations

There are four external locations created and one storage credential used by them all. Users and groups can be granted access to the different storage locations within a Unity Catalog metastore. This allows you to provide specific groups access to different part of the cloud storage container.

You can create external tables using a storage location in a Unity Catalog metastore. Those external tables can then be secured independently. Databricks recommends that you create external tables from one storage location within one schema.

Azure Databricks strongly does not recommend registering common tables as external tables in more than one metastore due to the risk of consistency issues. For example, a change to the schema in one metastore will not register in the second metastore. Use Delta Sharing for sharing data between metastores. See Delta Sharing.

Configure access control

Each securable object in Unity Catalog has an owner. The principal that creates an object becomes its initial owner. An object’s owner has all privileges on the object, such as SELECT and MODIFY on a table, as well as the permission to grant privileges on the securable object to other principals. Only owners of a securable object have the permission to grant privileges on that object to other principals. Therefore, it is best practice to configure ownership on all objects to the group responsible for administration of grants on the object. Both the owner and metastore admins can transfer ownership of a securable object to a group. Additionally, if the object is contained within a catalog (like a table or view), the catalog and schema owner can change the ownership of the object.

Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. This means that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For more information, see Inheritance model.

In order to read data from a table or view a user must have the following privileges:

  • SELECT on the table or view
  • USE SCHEMA on the schema that owns the table
  • USE CATALOG on the catalog that owns the schema

USE CATALOG enables the grantee to traverse the catalog in order to access its child objects and USE SCHEMAenables the grantee to traverse the schema in order to access its child objects. For example, to select data from a table, users need to have the SELECT privilege on that table and the USE CATALOG privilege on its parent catalog as well the USE SCHEMA privilege on its parent schema. Therefore, you can use this privilege to restrict access to sections of your data namespace to specific groups. A common scenario is to set up a schema per team where only that team has USE SCHEMA and CREATE on the schema. This means that any tables produced by team members can only be shared within the team.

You can secure access to a table using the following SQL syntax:

GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;

You can secure access to columns using a dynamic view in a secondary schema as shown in the following SQL syntax:

CREATE VIEW < catalog_name >.< schema_name >.< view_name > as
SELECT
  id,
  CASE WHEN is_member(< group_name >) THEN email ELSE 'REDACTED' END AS email,
  country,
  product,
  total
FROM
  < catalog_name >.< schema_name >.< table_name >;
GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< view_name >;
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< view_name >;
TO < group_name >;

You can secure access to rows using a dynamic view in a secondary schema as shown in the following SQL syntax:

CREATE VIEW < catalog_name >.< schema_name >.< view_name > as
SELECT
  *
FROM
  < catalog_name >.< schema_name >.< table_name >
WHERE
  CASE WHEN is_member(managers) THEN TRUE ELSE total <= 1000000 END;
GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;
GRANT
SELECT
  ON < catalog_name >.< schema_name >.< table_name >;
TO < group_name >;

See Manage privileges in Unity Catalog.

Manage cluster configurations

Databricks recommends using cluster policies to limit the ability to configure clusters based on a set of rules. Cluster policies let you restrict access to only create clusters which are Unity Catalog-enabled. Using cluster policies reduces available choices, which will greatly simplify the cluster creation process for users and ensure that they are able to access data seamlessly. Cluster policies also enable you to control cost by limiting per cluster maximum cost.

To ensure the integrity of access controls and enforce strong isolation guarantees, Unity Catalog imposes security requirements on compute resources. For this reason, Unity Catalog introduces the concept of a cluster’s access mode. Unity Catalog is secure by default; if a cluster is not configured with an appropriate access mode, the cluster can’t access data in Unity Catalog. See Cluster access modes for Unity Catalog.

Databricks recommends using the User Isolation access mode when sharing a cluster and the Single User access mode for automated jobs and machine learning workloads.

The JSON below provides a policy definition for a shared cluster with the User Isolation security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9]*\\.x-scala.*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "USER_ISOLATION",
    "hidden": true
}
}

The JSON below provides a policy definition for an automated job cluster with the Single User security mode:

{
"spark_version": {
    "type": "regex",
    "pattern": "1[0-1]\\.[0-9].*",
    "defaultValue": "10.4.x-scala2.12"
},
"access_mode": {
    "type": "fixed",
    "value": "SINGLE_USER",
    "hidden": true
},
"single_user_name": {
    "type": "regex",
    "pattern": ".*",
    "hidden": true
}
}

Audit access

A complete data governance solution requires auditing access to data and providing alerting and monitoring capabilities. Unity Catalog captures an audit log of actions performed against the metastore and these logs are delivered as part of Azure Databricks audit logs.

Make sure you configure audit logging in your Azure Databricks workspaces. After logging is enabled for your account, Azure Databricks automatically starts sending diagnostic logs to the delivery location you specified.

See Monitoring Your Databricks Lakehouse Platform with Audit Logs for details on how to get complete visibility into critical events relating to your Databricks Lakehouse Platform.

Delta Sharing

Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations or other departments within your organization, regardless of which computing platforms they use. When Delta Sharing is enabled on a metastore, Unity Catalog runs a Delta Sharing server.

To share data between metastores, you can leverage Databricks-to-Databricks Delta Sharing. This allows you to register tables from metastores in different regions. These tables will appear as read-only objects in the consuming metastore. These tables can be granted access like any other object within Unity Catalog.

When you use Databricks-to-Databricks Delta Sharing to share between metastores, keep in mind that access control is limited to one metastore. If a securable object, like a table, has grants on it and that resource is shared to an intra-account metastore, then the grants from the source will not apply to the destination share. The destination share will have to set its own grants.

Learn more