What is Unity Catalog?

This article introduces Unity Catalog, a unified governance solution for data and AI assets on the Databricks lakehouse.

Overview of Unity Catalog

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.

Unity Catalog diagram

Key features of Unity Catalog include:

  • Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces.
  • Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.
  • Built-in auditing and lineage: Unity Catalog automatically captures user-level audit logs that record access to your data. Unity Catalog also captures lineage data that tracks how data assets are created and used across all languages.
  • Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
  • System tables (Public Preview): Unity Catalog lets you easily access and query your account’s operational data, including audit logs, billable usage, and lineage.

How does Unity Catalog govern access to data and AI assets in cloud object storage?

Databricks recommends that you configure all access to cloud object storage using Unity Catalog. See Connect to cloud object storage using Unity Catalog.

Unity Catalog introduces the following concepts to manage relationships between data in Azure Databricks and cloud object storage:

Note

Lakehouse Federation provides integrations to data in other external systems. These objects are not backed by cloud object storage.

The Unity Catalog object model

In Unity Catalog, the hierarchy of primary data objects flows from metastore to table or volume:

  • Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organizes your data.
  • Catalog: The first layer of the object hierarchy, used to organize your data assets.
  • Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
  • Tables, views, and volumes: At the lowest level in the data object hierarchy are tables, views, and volumes. Volumes provide governance for non-tabular data.
  • Models: Although they are not, strictly speaking, data assets, registered models can also be managed in Unity Catalog and reside at the lowest level in the object hierarchy.

Unity Catalog object model diagram

This is a simplified view of securable Unity Catalog objects. For more details, see Securable objects in Unity Catalog.

You reference all data in Unity Catalog using a three-level namespace: catalog.schema.asset, where asset can be a table, view, volume, or model.

Metastores

A metastore is the top-level container of objects in Unity Catalog. It registers metadata about data and AI assets and the permissions that govern access to them. Azure Databricks account admins should create one metastore for each region in which they operate and assign them to Azure Databricks workspaces in the same region. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.

A metastore can optionally be configured with a managed storage location in an Azure Data Lake Storage Gen2 container or Cloudflare R2 bucket in your own cloud storage account. See Managed storage.

Note

This metastore is distinct from the Hive metastore included in Azure Databricks workspaces that have not been enabled for Unity Catalog. If your workspace includes a legacy Hive metastore, the data in that metastore will still be available alongside data defined in Unity Catalog, in a catalog named hive_metastore. Note that the hive_metastore catalog is not managed by Unity Catalog and does not benefit from the same feature set as catalogs defined in Unity Catalog.

See Create a Unity Catalog metastore.

Catalogs

A catalog is the first layer of Unity Catalog’s three-level namespace. It’s used to organize your data assets. Users can see all catalogs on which they have been assigned the USE CATALOG data permission.

Depending on how your workspace was created and enabled for Unity Catalog, your users may have default permissions on automatically provisioned catalogs, including either the main catalog or the workspace catalog (<workspace-name>). For more information, see Default user privileges.

See Create and manage catalogs.

Schemas

A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace. A schema organizes tables and views. Users can see all schemas on which they have been assigned the USE SCHEMA permission, along with the USE CATALOG permission on the schema’s parent catalog. To access or list a table or view in a schema, users must also have SELECT permission on the table or view.

If your workspace was enabled for Unity Catalog manually, it includes a default schema named default in the main catalog that is accessible to all users in your workspace. If your workspace was enabled for Unity Catalog automatically and includes a <workspace-name> catalog, that catalog contains a schema named default that is accessible to all users in your workspace.

See Create and manage schemas (databases).

Tables

A table resides in the third layer of Unity Catalog’s three-level namespace. It contains rows of data. To create a table, users must have CREATE and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog. To query a table, users must have the SELECT permission on the table, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

A table can be managed or external.

Managed tables

Managed tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Azure Databricks to manipulate files in these tables directly. Managed tables always use the Delta table format.

For workspaces that were enabled for Unity Catalog manually, managed tables are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed table storage locations at the catalog or schema levels, overriding the root storage location.

For workspaces that were enabled for Unity Catalog automatically, the metastore root storage location is optional, and managed tables are typically stored at the catalog or schema levels.

When a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.

See Managed tables.

External tables

External tables are tables whose data lifecycle and file layout are not managed by Unity Catalog. Use external tables to register large amounts of existing data in Unity Catalog, or if you require direct access to the data using tools outside of Azure Databricks clusters or Databricks SQL warehouses.

When you drop an external table, Unity Catalog does not delete the underlying data. You can manage privileges on external tables and use them in queries in the same way as managed tables.

External tables can use the following file formats:

  • DELTA
  • CSV
  • JSON
  • AVRO
  • PARQUET
  • ORC
  • TEXT

See External tables.

Views

A view is a read-only object created from one or more tables and views in a metastore. It resides in the third layer of Unity Catalog’s three-level namespace. A view can be created from tables and other views in multiple schemas and catalogs. You can create dynamic views to enable row- and column-level permissions.

See Create a dynamic view.

Volumes

Important

This feature is in Public Preview.

A volume resides in the third layer of Unity Catalog’s three-level namespace. Volumes are siblings to tables, views, and other objects organized under a schema in Unity Catalog.

Volumes contain directories and files for data stored in any format. Volumes provide non-tabular access to data, meaning that files in volumes cannot be registered as tables.

  • To create a volume, users must have CREATE VOLUME and USE SCHEMA permissions on the schema, and they must have the USE CATALOG permission on its parent catalog.
  • To read files and directories stored inside a volume, users must have the READ VOLUME permission, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.
  • To add, remove, or modify files and directories stored inside a volume, users must have WRITE VOLUME permission, the USE SCHEMA permission on its parent schema, and the USE CATALOG permission on its parent catalog.

A volume can be managed or external.

Note

When you define a volume, cloud URI access to data under the volume path is governed by the permissions of the volume.

Managed volumes

Managed volumes are a convenient solution when you want to provision a governed location for working with non-tabular files.

Managed volumes store files in the Unity Catalog default storage location for the schema in which they’re contained. For workspaces that were enabled for Unity Catalog manually, managed volumes are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed volume storage locations at the catalog or schema levels, overriding the root storage location. For workspaces that were enabled for Unity Catalog automatically, the metastore root storage location is optional, and managed volumes are typically stored at the catalog or schema levels.

The following precedence governs which location is used for a managed volume:

  • Schema location
  • Catalog location
  • Unity Catalog metastore root storage location

When you delete a managed volume, the files stored in this volume are also deleted from your cloud tenant within 30 days.

See What is a managed volume?.

External volumes

An external volume is registered to a Unity Catalog external location and provides access to existing files in cloud storage without requiring data migration. Users must have the CREATE EXTERNAL VOLUME permission on the external location to create an external volume.

External volumes support scenarios where files are produced by other systems and staged for access from within Azure Databricks using object storage or where tools outside Azure Databricks require direct file access.

Unity Catalog does not manage the lifecycle and layout of the files in external volumes. When you drop an external volume, Unity Catalog does not delete the underlying data.

See What is an external volume?.

Models

A model resides in the third layer of Unity Catalog’s three-level namespace. In this context, “model” refers to a machine learning model that is registered in the MLflow Model Registry. To create a model in Unity Catalog, users must have the CREATE MODEL privilege for the catalog or schema. The user must also have the USE CATALOG privilege on the parent catalog and USE SCHEMA on the parent schema.

Managed storage

You can store managed tables and managed volumes at any of these levels in the Unity Catalog object hierarchy: metastore, catalog, or schema. Storage at lower levels in the hierarchy overrides storage defined at higher levels.

When an account admin creates a metastore manually, they have the option to assign a storage location in an Azure Data Lake Storage Gen2 container or Cloudflare R2 bucket in your own cloud storage account to use as metastore-level storage for managed tables and volumes. If a metastore-level managed storage location has been assigned, then managed storage locations at the catalog and schema levels are optional. That said, metastore-level storage is optional, and Databricks recommends assigning managed storage at the catalog level for logical data isolation. See Data governance and data isolation building blocks.

Important

If your workspace was enabled for Unity Catalog automatically, the Unity Catalog metastore was created without metastore-level managed storage. You can opt to add metastore-level storage, but Databricks recommends assigning managed storage at the catalog and schema levels. For help deciding whether you need metastore-level storage, see (Optional) Create metastore-level storage and Data is physically separated in storage.

Managed storage has the following properties:

  • Managed tables and managed volumes store data and metadata files in managed storage.
  • Managed storage locations cannot overlap with external tables or external volumes.

The following table describes how managed storage is declared and associated with Unity Catalog objects:

Associated Unity Catalog object How to set Relation to external locations
Metastore Configured by account admin during metastore creation or added after metastore creation if no storage was specified at creation. Cannot overlap an external location.
Catalog Specified during catalog creation using the MANAGED LOCATION keyword. Must be contained within an external location.
Schema Specified during schema creation using the MANAGED LOCATION keyword. Must be contained within an external location.

The managed storage location used to store data and metadata for managed tables and managed volumes uses the following rules:

  • If the containing schema has a managed location, the data is stored in the schema managed location.
  • If the containing schema does not have a managed location but the catalog has a managed location, the data is stored in the catalog managed location.
  • If neither the containing schema nor the containing catalog have a managed location, data is stored in the metastore managed location.

Storage credentials and external locations

To manage access to the underlying cloud storage for external tables, external volumes, and managed storage, Unity Catalog uses the following object types:

See Connect to cloud object storage using Unity Catalog.

Identity management for Unity Catalog

Unity Catalog uses the identities in the Azure Databricks account to resolve users, service principals, and groups, and to enforce permissions.

To configure identities in the account, follow the instructions in Manage users, service principals, and groups. Refer to those users, service principals, and groups when you create access-control policies in Unity Catalog.

Unity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Catalog Explorer, or a REST API command. The assignment of users, service principals, and groups to workspaces is called identity federation.

All workspaces that have a Unity Catalog metastore attached to them are enabled for identity federation.

Special considerations for groups

Any groups that already exist in the workspace are labeled Workspace local in the account console. These workspace-local groups cannot be used in Unity Catalog to define access policies. You must use account-level groups. If a workspace-local group is referenced in a command, that command will return an error that the group was not found. If you previously used workspace-local groups to manage access to notebooks and other artifacts, these permissions remain in effect.

See Manage groups.

Admin roles for Unity Catalog

Account admins, metastore admins, and workspace admins are involved in managing Unity Catalog:

See Admin privileges in Unity Catalog.

Data permissions in Unity Catalog

In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.

You can assign and revoke permissions using Catalog Explorer, SQL commands, or REST APIs.

See Manage privileges in Unity Catalog.

Supported compute and cluster access modes for Unity Catalog

Unity Catalog is supported on clusters that run Databricks Runtime 11.3 LTS or above. Unity Catalog is supported by default on all SQL warehouse compute versions.

Clusters running on earlier versions of Databricks Runtime do not provide support for all Unity Catalog GA features and functionality.

To access data in Unity Catalog, clusters must be configured with the correct access mode. Unity Catalog is secure by default. If a cluster is not configured with one of the Unity-Catalog-capable access modes (that is, shared or assigned), the cluster can’t access data in Unity Catalog. See Access modes.

For detailed information about Unity Catalog functionality changes in each Databricks Runtime version, see the release notes.

Limitations for Unity Catalog vary by access mode and Databricks Runtime version. See Compute access mode limitations for Unity Catalog.

Data lineage for Unity Catalog

You can use Unity Catalog to capture runtime data lineage across queries in any language executed on an Azure Databricks cluster or SQL warehouse. Lineage is captured down to the column level, and includes notebooks, workflows and dashboards related to the query. To learn more, see Capture and view data lineage using Unity Catalog.

Lakehouse Federation and Unity Catalog

Lakehouse Federation is the query federation platform for Azure Databricks. The term query federation describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system.

Azure Databricks uses Unity Catalog to manage query federation. You use Unity Catalog to configure read-only connections to popular external database systems and create foreign catalogs that mirror external databases. Unity Catalog’s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Azure Databricks workspaces.

See What is Lakehouse Federation.

How do I set up Unity Catalog for my organization?

To learn how to set up Unity Catalog, see Set up and manage Unity Catalog.

Supported regions

All regions support Unity Catalog. For details, see Azure Databricks regions.

Supported data file formats

Unity Catalog supports the following table formats:

Unity Catalog limitations

Unity Catalog has the following limitations.

Note

If your cluster is running on a Databricks Runtime version below 11.3 LTS, there may be additional limitations, not listed here. Unity Catalog is supported on Databricks Runtime 11.3 LTS or above.

Unity Catalog limitations vary by Databricks Runtime and access mode. Structured Streaming workloads have additional limitations based on Databricks Runtime and access mode. See Compute access mode limitations for Unity Catalog.

  • Workloads in R do not support the use of dynamic views for row-level or column-level security.

  • In Databricks Runtime 13.1 and above, shallow clones are supported to create Unity Catalog managed tables from existing Unity Catalog managed tables. In Databricks Runtime 13.0 and below, there is no support for shallow clones in Unity Catalog. See Shallow clone for Unity Catalog tables.

  • Bucketing is not supported for Unity Catalog tables. If you run commands that try to create a bucketed table in Unity Catalog, it will throw an exception.

  • Writing to the same path or Delta Lake table from workspaces in multiple regions can lead to unreliable performance if some clusters access Unity Catalog and others do not.

  • Custom partition schemes created using commands like ALTER TABLE ADD PARTITION are not supported for tables in Unity Catalog. Unity Catalog can access tables that use directory-style partitioning.

  • Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats. The user must have the CREATE privilege on the parent schema and must be the owner of the existing object or have the MODIFY privilege on the object.

  • In Databricks Runtime 13.2 and above, Python scalar UDFs are supported. In Databricks Runtime 13.1 and below, you cannot use Python UDFs, including UDAFs, UDTFs, and Pandas on Spark (applyInPandas and mapInPandas).

  • In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported on shared clusters. In Databricks Runtime 14.1 and below, all Scala UDFs are not supported on shared clusters.

  • Groups that were previously created in a workspace (that is, workspace-level groups) cannot be used in Unity Catalog GRANT statements. This is to ensure a consistent view of groups that can span across workspaces. To use groups in GRANT statements, create your groups at the account level and update any automation for principal or group management (such as SCIM, Okta and Microsoft Entra ID (formerly Azure Active Directory) connectors, and Terraform) to reference account endpoints instead of workspace endpoints. See Difference between account groups and workspace-local groups.

  • Standard Scala thread pools are not supported. Instead, use the special thread pools in org.apache.spark.util.ThreadUtils, for example, org.apache.spark.util.ThreadUtils.newDaemonFixedThreadPool. However, the following thread pools in ThreadUtils are not supported: ThreadUtils.newForkJoinPool and any ScheduledExecutorService thread pool.

  • Audit logging is supported for Unity Catalog events at the workspace level only. Events that take place at the account level without reference to a workspace, such as creating a metastore, are not logged.

The following limitations apply for all object names in Unity Catalog:

  • Object names cannot exceed 255 characters.
  • The following special characters are not allowed:
    • Period (.)
    • Space ( )
    • Forward slash (/)
    • All ASCII control characters (00-1F hex)
    • The DELETE character (7F hex)
  • Unity Catalog stores all object names as lowercase.
  • When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (-).

Note

Column names can use special characters, but the name must be escaped with backticks in all SQL statements if special characters are used. Unity Catalog preserves column name casing, but queries against Unity Catalog tables are case-insensitive.

Additional limitations exists for models in Unity Catalog. See Limitations on Unity Catalog support.

Resource quotas

Unity Catalog enforces resource quotas on all securable objects. Limits respect the same hierarchical organization throughout Unity Catalog. If you expect to exceed these resource limits, contact your Azure Databricks account team.

Quota values below are expressed relative to the parent (or grandparent) object in Unity Catalog.

Object Parent Value
table schema 10000
table metastore 100000
volume schema 10000
function schema 10000
registered model schema 1000
registered model metastore 5000
model version registered model 10000
model version metastore 100000
schema catalog 10000
catalog metastore 1000
connection metastore 1000
storage credential metastore 200
external location metastore 500

For Delta Sharing limits, see Resource quotas.