Get started using Unity Catalog

This article provides step-by-step instructions for setting up Unity Catalog for your organization. It describes how to enable your Azure Databricks account to use Unity Catalog and how to create your first tables in Unity Catalog.

Overview of Unity Catalog setup

This section provides a high-level overview of how to set up your Azure Databricks account to use Unity Catalog and create your first tables. For detailed step-by-step instructions, see the sections that follow this one.

To enable your Azure Databricks account to use Unity Catalog, you do the following:

  1. Configure a storage container and Azure managed identity that Unity Catalog can use to store and access data in your Azure account.

  2. Create a metastore for each region in which your organization operates. This metastore functions as the top-level container for all of your data in Unity Catalog.

    As the creator of the metastore, you are its owner and metastore admin.

  3. Attach workspaces to the metastore. Each workspace will have the same view of the data that you manage in Unity Catalog.

  4. Add users, groups, and service principals to your Azure Databricks account.

    For existing Azure Databricks accounts, these identities are already present.

  5. (Optional) Transfer your metastore admin role to a group.

To set up data access for your users, you do the following:

  1. In a workspace, create at least one compute resource: either a cluster or SQL warehouse.

    You will use this compute resource when you run queries and commands, including grant statements on data objects that are secured in Unity Catalog.

  2. Create at least one catalog.

    Catalogs hold the schemas (databases) that in turn hold the tables that your users work with.

  3. Create at least one schema.

  4. Create tables.

For each level in the data hierarchy (catalogs, schemas, tables), you grant privileges to users, groups, or service principals. You can also grant row- or column-level privileges using dynamic views.

Requirements

  • You must be an Azure Databricks account admin.

    The first Azure Databricks account admin must be an Azure Active Directory Global Administrator at the time that they first log in to the Azure Databricks account console. Upon first login, that user becomes an Azure Databricks account admin and no longer needs the Azure Active Directory Global Administrator role to access the Azure Databricks account. The first account admin can assign users in the Azure Active Directory tenant as additional account admins (who can themselves assign more account admins). Additional account admins do not require specific roles in Azure Active Directory.

  • Your Azure Databricks account must be on the Premium plan.

  • In your Azure tenant, you must have permission to create:

Configure and grant access to Azure storage for your metastore

In this step, you create a storage account and container for the metadata and tables that will be managed by the Unity Catalog metastore, create an Azure connector that generates a system-assigned managed identity, and give that managed identity access to the storage container.

  1. Create a storage account for Azure Data Lake Storage Gen2.

    This storage account will contain metadata related to Unity Catalog metastores and their objects, as well as the data for managed tables in Unity Catalog. See Create a storage account to use with Azure Data Lake Storage Gen2. Make a note of the region where you created the storage account.

  2. Create a storage container that will hold your Unity Catalog metastore’s metadata and managed tables.

    You can create no more than one metastore per region. It is recommended that you use the same region for your metastore and storage container.

    This default storage location can be overridden at the catalog and schema levels.

    Make a note of the ADLSv2 URI for the container, which is in the following format:

    abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>
    

    In the steps that follow, replace <storage-container> with this URI.

  3. In Azure, create an Azure Databricks access connector that holds a managed identity and give it access to the storage container.

    See Use Azure managed identities in Unity Catalog to access storage.

Create your first metastore and attach a workspace

To use Unity Catalog, you must create a metastore. A metastore is the top-level container for data in Unity Catalog. Each metastore exposes a three-level namespace (catalog.schema.table) by which data can be organized.

You create a metastore for each region in which your organization operates. You can link each of these regional metastores to any number of workspaces in that region.

Each linked workspace has the same view of the data in the metastore, and data access control can be managed across workspaces.

You can access data across metastores using Delta Sharing.

To create a metastore:

  1. Make sure that you have the path to the storage container and the resource ID of the Azure Databricks access connector that you created in the previous task.

  2. Log in to your workspace as an account admin.

  3. Click your username in the top bar of the Azure Databricks workspace and select Manage Account.

  4. Log in to the Azure Databricks account console.

  5. Click Data Icon Data.

  6. Click Create Metastore.

  7. Enter values for the following fields

    • Name for the metastore.

    • Region where the metastore will be deployed.

      This must be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the storage container you created earlier.

    • ADLS Gen 2 path: Enter the path to the storage container that you will use as root storage for the metastore.

      The abfss:// prefix is added automatically.

    • Access Connector ID: Enter the Azure Databricks access connector’s resource ID in the format:

      /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
      
  8. Click Create.

  9. When prompted, select workspaces to link to the metastore.

    For more information about assigning workspaces to metastores, see Enable a workspace for Unity Catalog.

The user who creates a metastore is its owner, also called the metastore admin. The metastore admin can create top-level objects in the metastore such as catalogs and can manage access to tables and other objects. Databricks recommends that you reassign the metastore admin role to a group. See (Recommended) Transfer ownership of your metastore to a group.

Add users and groups

A Unity Catalog metastore can be shared across multiple Azure Databricks workspaces. Unity Catalog takes advantage of Azure Databricks account-level identity management to provide a consistent view of users, service principals, and groups across all workspaces. In this step, you create users and groups in the account console and then choose the workspaces these identities can access.

Note

  • If you have an existing account and workspaces, your probably already have existing users and groups in your account, so you can skip this step.
  • If you have a large number of users or groups in your account, or if you prefer to manage identities outside of Azure Databricks, you can sync users and groups from Azure AD.

Requirements

If you are adding identities to a new Azure Databricks account for the first time, you must have the Contributor role in the Azure Active Directory root management group, which is named Tenant root group by default.

Only the initial Azure Databricks account admin must have this role. Any Azure Active Directory Global Administrator can add themselves to this group. If you do not have this role, grant it to yourself or ask an Azure Active Directory Global Administrator to grant it to you.

The initial account-level admin can add users or groups to the account and can designate other account-level admins by granting the Admin role to users.

Step-by-step

To add a user and group using the account console:

  1. Log in to the account console.
  2. Click Account Console user management icon User management.
  3. Add a user:
    1. Click Users.
    2. Click Add User.
    3. Enter a name and email address for the user.
    4. Click Send Invite.
  4. Add a group:
    1. Click Groups.
    2. Click Add Group.
    3. Enter a name for the group.
    4. Click Confirm.
    5. When prompted, add users to the group.
  5. Add a user or group to a workspace, where they can perform data science, data engineering, and data analysis tasks using the data managed by Unity Catalog:
    1. In the sidebar, click Workspace Icon Workspaces.
    2. On the Permissions tab, click Add permissions.
    3. Search for and select the user or group, assign the permission level (workspace User or Admin), and click Save.
  6. To designate additional account-level admins:
    1. As an account admin, log in to the account console.
    2. Click Account Console user management icon User management.
    3. Find and click the username.
    4. On the Roles tab, turn on Account admin.

To get started, create a group called data-consumers. This group is used later in this walk-through.

Create a cluster or SQL warehouse

Tables defined in Unity Catalog are protected by fine-grained access controls. To ensure that access controls are enforced, Unity Catalog requires compute resources to conform to a secure configuration. Unity Catalog is secure by default, meaning that non-conforming compute resources cannot access tables in Unity Catalog.

Azure Databricks provides two kinds of compute resources:

  • Clusters, which are used for workloads in the Data Science & Engineering and Databricks Machine Learning persona-based environments.
  • SQL warehouses, which are used for executing queries in Databricks SQL.

You can use either of these compute resources to work with Unity Catalog, depending on the environment you are using: SQL warehouses for Databricks SQL or clusters for the Data Science & Engineering and Databricks Machine Learning environments.

Create a cluster

To create a cluster that can access Unity Catalog:

  1. Log in to your workspace as a workspace admin or user with permission to create clusters.
  2. Click compute icon Compute.
  3. Click Create compute.
    1. Enter a name for the cluster.

    2. Set the Access mode to Single user.

      Only Single user and Shared access modes support Unity Catalog. See What is cluster access mode?.

    3. Set Databricks runtime version to Runtime: 11.1 (Scala 2.12, Spark 3.2.1) or higher.

  4. Click Create Cluster.

For specific configuration options, see Create a cluster.

Create a SQL warehouse

SQL warehouses support Unity Catalog by default, and there is no special configuration required.

To create a SQL warehouse:

  1. Log in to your workspace as a workspace admin or user with permission to create clusters.
  2. From the persona switcher, select SQL.
  3. Click Create and select SQL Warehouse.

For specific configuration options, see Create a SQL warehouse.

Create your first table

In Unity Catalog, metastores contain catalogs that contain schemas (databases), and you always create a table in a schema.

You can refer to a table using three-level notation:

<catalog>.<schema>.<table>

A newly-created metastore contains a catalog named main with an empty schema named default. In this example, you will create a table named department in the default schema in the main catalog.

To create a table, you must have the CREATE TABLE permission on the parent schema, the USE CATALOG permission on the parent catalog, and the USE SCHEMA permission on the parent schema. Metastore admins have these permissions by default.

The main catalog and main.default schema are unique in that all users begin with the USE CATALOG privilege on the main catalog and the USE SCHEMA privilege on the main.default schema. If you are not a metastore admin, either a metastore admin or the owner of the schema can grant you the CREATE TABLE privilege on the main.default schema.

Follow these steps to create a table manually. You can also import an example notebook and run it to create a catalog, schema, and table, along with managing permissions on each.

  1. Create a notebook and attach it to the cluster you created in Create a cluster or SQL warehouse.

    For the notebook language, select SQL, Python, R, or Scala, depending on the language you want to use.

  2. Grant permission to create tables on the default schema.

    To create tables, users require the CREATE TABLE and USE SCHEMA permissions on the schema in addition to the USE CATALOG permission on the catalog. All users receive the USE CATALOG privilege on the main catalog and the USE SCHEMA privilege on the main.default schema when a metastore is created.

    Metastore admins and the owner of the schema main.default can use the following command to GRANT the CREATE TABLE privilege to a user or group:

    SQL

    GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`;
    

    Python

    spark.sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    

    R

    library(SparkR)
    
    sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    

    Scala

    spark.sql("GRANT CREATE TABLE ON SCHEMA <catalog-name>.<schema-name> TO `<EMAIL_ADDRESS>`")
    

    For example, to allow members of the group data-consumers to create tables in main.default:

    SQL

    GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`;
    

    Python

    spark.sql("GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`")
    

    R

    library(SparkR)
    
    sql("GRANT CREATE TABLE ON SCHEMA main.default TO `data-consumers`")
    

    Scala

    spark.sql("GRANT CREATE TABLE ON SCHEMA main.default to `data-consumers`")
    

    Run the cell.

  3. Create a new table called department.

    Add a new cell to the notebook. Paste in the following code, which specifies the table name, its columns, and inserts five rows into it.

    SQL

    CREATE TABLE main.default.department
    (
      deptcode   INT,
      deptname  STRING,
      location  STRING
    );
    
    INSERT INTO main.default.department VALUES
      (10, 'FINANCE', 'EDINBURGH'),
      (20, 'SOFTWARE', 'PADDINGTON'),
      (30, 'SALES', 'MAIDSTONE'),
      (40, 'MARKETING', 'DARLINGTON'),
      (50, 'ADMIN', 'BIRMINGHAM');
    

    Python

    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    schema = StructType([ \
      StructField("deptcode", IntegerType(), True),
      StructField("deptname", StringType(), True),
      StructField("location", StringType(), True)
    ])
    
    spark.catalog.createTable(
      tableName = "main.default.department",
      schema = schema \
    )
    
    dfInsert = spark.createDataFrame(
      data = [
        (10, "FINANCE", "EDINBURGH"),
        (20, "SOFTWARE", "PADDINGTON"),
        (30, "SALES", "MAIDSTONE"),
        (40, "MARKETING", "DARLINGTON"),
        (50, "ADMIN", "BIRMINGHAM")
      ],
      schema = schema
    )
    
    dfInsert.write.saveAsTable(
      name = "main.default.department",
      mode = "append"
    )
    

    R

    library(SparkR)
    
    schema = structType(
      structField("deptcode", "integer", TRUE),
      structField("deptname", "string", TRUE),
      structField("location", "string", TRUE)
    )
    
    df = createDataFrame(
      data = list(),
      schema = schema
    )
    
    saveAsTable(
      df = df,
      tableName = "main.default.department"
    )
    
    data = list(
      list("deptcode" = 10L, "deptname" = "FINANCE", "location" = "EDINBURGH"),
      list("deptcode" = 20L, "deptname" = "SOFTWARE", "location" = "PADDINGTON"),
      list("deptcode" = 30L, "deptname" = "SALES", "location" = "MAIDSTONE"),
      list("deptcode" = 40L, "deptname" = "MARKETING", "location" = "DARLINGTON"),
      list("deptcode" = 50L, "deptname" = "ADMIN", "location" = "BIRMINGHAM")
    )
    
    dfInsert = createDataFrame(
      data = data,
      schema = schema
    )
    
    insertInto(
      x = dfInsert,
      tableName = "main.default.department"
    )
    

    Scala

    import spark.implicits._
    import org.apache.spark.sql.types.StructType
    
    val df = spark.createDataFrame(
      new java.util.ArrayList[Row](),
      new StructType()
        .add("deptcode", "int")
        .add("deptname", "string")
        .add("location", "string")
    )
    
    df.write
      .format("delta")
      .saveAsTable("main.default.department")
    
    val dfInsert = Seq(
      (10, "FINANCE", "EDINBURGH"),
      (20, "SOFTWARE", "PADDINGTON"),
      (30, "SALES", "MAIDSTONE"),
      (40, "MARKETING", "DARLINGTON"),
      (50, "ADMIN", "BIRMINGHAM")
    ).toDF("deptcode", "deptname", "location")
    
    dfInsert.write.insertInto("main.default.department")
    

    Run the cell.

  4. Query the table.

    Add a new cell to the notebook. Paste in the following code, then run the cell.

    SQL

    SELECT * from main.default.department;
    

    Python

    display(spark.table("main.default.department"))
    

    R

    display(tableToDF("main.default.department"))
    

    Scala

    display(spark.table("main.default.department"))
    
  5. Grant the ability to read and query the table to the data-consumers group that you created in Add users and groups.

    Add a new cell to the notebook and paste in the following code:

    SQL

    GRANT SELECT ON main.default.department TO `data-consumers`;
    

    Python

    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    

    R

    sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    

    Scala

    spark.sql("GRANT SELECT ON main.default.department TO `data-consumers`")
    

    Note

    To grant read access to all account-level users instead of only data-consumers, use the group name account users instead.

    Run the cell.

Short-cut: use an example notebook to create a catalog, schema, and table

You can use the following example notebook to create a catalog, schema, and table, as well as manage permissions on each.

Create and manage a Unity Catalog table with SQL

Get notebook

Create and manage a Unity Catalog table with Python

Get notebook

A key benefit of Unity Catalog is the ability to share a single metastore among multiple workspaces that are located in the same region. You can run different types of workloads against the same data without moving or copying data amongst workspaces. Each workspace can have only one Unity Catalog metastore assigned to it.

To learn how to link the metastore to additional workspaces, see Enable a workspace for Unity Catalog.

You can manage user access to Azure Databricks by setting up provisioning from Azure Active Directory. For complete instructions, see Sync users and groups from Azure Active Directory.

See Assign a metastore admin.

(Optional) Install the Unity Catalog CLI

The Unity Catalog CLI is part of the Databricks CLI setup & documentation. To use the Unity Catalog CLI, do the following:

  1. Set up the CLI.
  2. Set up authentication.
  3. Optionally, create one or more connection profiles to use with the CLI.
  4. Learn how to use the Databricks CLI in general.
  5. Begin using the Unity Catalog CLI.

Next steps