What is Unity Catalog?
This article introduces Unity Catalog, the Azure Databricks data governance solution for the Lakehouse.
Overview of Unity Catalog
In Unity Catalog, admins and data stewards manage users and their access to data centrally across all of the workspaces in an Azure Databricks account. Users in different workspaces can share access to the same data, depending on privileges granted centrally in Unity Catalog.
Key features of Unity Catalog include:
- Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces and personas.
- Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.
- Built-in auditing: Unity Catalog automatically captures user-level audit logs that record access to your data.
In Unity Catalog, the hierarchy of primary data objects flows from metastore to table:
- Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (
table) that organizes your data.
- Catalog: The first layer of the object hierarchy, used to organize your data assets.
- Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
- Table: The lowest level in the object hierarchy, tables can be external (stored in external locations in your cloud storage of choice) or managed tables (stored in a storage container in your cloud storage that you create expressly for Azure Databricks). You can also create read-only Views from tables.
You reference all data in Unity Catalog using a three-level namespace.
A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access to them. Azure Databricks account admins can create metastores and assign them to Azure Databricks workspaces to control which workloads use each metastore. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.
Each metastore is configured with a root storage location in an Azure Data Lake Storage Gen2 container in your Azure account. This storage location is used for metadata and managed tables data.
This metastore is distinct from the metastore included in Azure Databricks workspaces created before Unity Catalog was released. If your workspace includes a legacy Hive metastore, the data in that metastore is available in Unity Catalog in a catalog named
A catalog is the first layer of Unity Catalog’s three-level namespace. It’s used to organize your data assets. Users can see all catalogs on which they have been assigned the
USAGE data permission.
A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace. A schema organizes tables and views. To access (or list) a table or view in a schema, users must have the
USAGE data permission on the schema and its parent catalog, and they must have the
SELECT permission on the table or view.
A table resides in the third layer of Unity Catalog’s three-level namespace. It contains rows of data. To create a table, users must have
USAGE permissions on the schema, and they must have the
USAGE permission on its parent catalog. To query a table, users must have the
SELECT permission on the table, and they must have the
USAGE permission on its parent schema and catalog.
A table can be managed or external.
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the root storage location you configure when you create a metastore. They use the Delta table format.
When a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.
See Managed tables.
External tables are tables whose data is stored outside of the root storage location. Use external tables only when you require direct access to the data using other tools.
When you drop an external table, Unity Catalog does not delete the underlying data. You can manage privileges on external tables and use them in queries in the same way as managed tables.
External tables can use the following file formats:
See External tables.
Storage credentials and external locations
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces the following object types:
- Storage credentials encapsulate a long-term cloud credential that provides access to cloud storage. For example, an Azure managed identity that can access an Azure Data Lake Storage Gen2 container.
- External locations contain a reference to a storage credential and a cloud storage path.
A view is a read-only object created from one or more tables and views in a metastore. It resides in the third layer of Unity Catalog’s three-level namespace. A view can be created from tables and other views in multiple schemas and catalogs. You can create dynamic views to enable row- and column-level permissions.
Unity Catalog uses the identities in the Azure Databricks account to resolve users, service principals, and groups, and to enforce permissions.
To configure identities in the account, follow the instructions in Manage users, service principals, and groups. Refer to those users, service principals, and groups when you create access-control policies in Unity Catalog.
Unity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Data Explorer, or a REST API command. The assignment of users, service principals, and groups to workspaces is called identity federation.
All workspaces that have a Unity Catalog metastore attached to them are enabled for identity federation.
Special considerations for groups
Any groups that already exist in the workspace are labeled Workspace local in the account console. These workspace-local groups cannot be used in Unity Catalog to define access policies. You must use account-level groups. If a workspace-local group is referenced in a command, that command will return an error that the group was not found. If you previously used workspace-local groups to manage access to notebooks and other artifacts, these permissions remain in effect.
See Manage groups.
The following admin roles are required for managing Unity Catalog:
Account admins can manage identities, cloud resources and the creation of workspaces and Unity Catalog metastores.
Account admins can enable workspaces for Unity Catalog. They can grant both workspace and metastore admin permissions.
Metastore admins can manage privileges and ownership for all securable objects within a metastore, such as who can create catalogs or query a table.
The account admin who creates the Unity Catalog metastore becomes the initial metastore admin. The metastore admin can also choose to delegate this role to another user or group. We recommend assigning the metastore admin to a group, in which case any member of the group receives the privileges of the metastore admin. See (Recommended) Transfer ownership of your metastore to a group.
Workspace admins can add users to an Azure Databricks workspace, assign them the workspace admin role, and manage access to objects and functionality in the workspace, such as the ability to create clusters and change job ownership.
Data permissions in Unity Catalog
In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.
You can assign and revoke permissions using Data Explorer, SQL commands, or REST APIs.
To access data in Unity Catalog, clusters must be configured with the correct access mode. Unity Catalog is secure by default. If a cluster is not configured with one of the Unity-Catalog-capable access modes (that is, shared or single user), the cluster can’t access data in Unity Catalog.
Data lineage for Unity Catalog
You can use Unity Catalog to capture runtime data lineage across queries in any language executed on an Azure Databricks cluster or SQL warehouse. Lineage is captured down to the column level, and includes notebooks, workflows and dashboards related to the query. To learn more, see Capture and view data lineage with Unity Catalog.
How do I set up Unity Catalog for my organization?
To set up Unity Catalog for your organization, you do the following:
- Configure a storage container and Azure managed identity that Unity Catalog can use to store and access data in your Azure account.
- Create a metastore for each region in which your organization operates, and attach workspaces to the metastore. Each workspace will have the same view of the data you manage in Unity Catalog.
- If you have a new account, add users, groups, and service principals to your Azure Databricks account.
Next, you create and grant access to catalogs, schemas, and tables.
For complete setup instructions, see Get started using Unity Catalog.
Submit and view feedback for