Read data shared using Databricks-to-Databricks Delta Sharing (for recipients) - Azure Databricks

Článok
11/14/2024

This article describes how to read data that has been shared with you using the Databricks-to-Databricks Delta Sharing protocol, in which Databricks manages a secure connection for data sharing. Unlike the Delta Sharing open sharing protocol, the Databricks-to-Databricks protocol does not require a credential file (token-based security).

Databricks-to-Databricks sharing requires that you, as a recipient, have access to a Databricks workspace that is enabled for Unity Catalog.

If you do not have a Databricks workspace that is enabled for Unity Catalog, then data must be shared with you using the Delta Sharing open sharing protocol, and this article doesn’t apply to you. See Read data shared using Delta Sharing open sharing (for recipients).

How do I make shared data available to my team?

To read data and notebooks that have been shared with you using the Databricks-to-Databricks protocol, you must be a user on a Databricks workspace that is enabled for Unity Catalog. A member of your team provides the data provider with a unique identifier for your Unity Catalog metastore, and the data provider uses that identifier to create a secure sharing connection with your organization. The shared data then becomes available for read access in your workspace, and any updates that the data provider makes to the shared tables, views, volumes, and partitions are reflected in your workspace in near real time.

Note

Updates to shared data tables, views, and volumes appear in your workspace in near real time. However, column changes (adding, renaming, deleting) may not appear in Catalog Explorer for up to one minute. Likewise, new shares and updates to shares (such as adding new tables to a share) are cached for one minute before they are available for you to view and query.

To read data that has been shared with you:

A user on your team finds the share—the container for the tables, views, volumes, and notebooks that have been shared with you—and uses that share to create a catalog—the top-level container for all data in Databricks Unity Catalog.
A user on your team grants or denies access to the catalog and the objects inside the catalog (schemas, tables, views, and volumes) to other members of your team.
You read the data in the tables, views, and volumes that you have been granted access to just like any other data asset in Databricks that you have read-only (SELECT or READ VOLUME) access to.
You preview and clone notebooks in the share, as long as you have the USE CATALOG privilege on the catalog.

Permissions required

To be able to list and view details about all providers and provider shares, you must be a metastore admin or have the USE PROVIDER privilege. Other users have access only to the providers and shares that they own.

To create a catalog from a provider share, you must be a metastore admin, a user who has both the CREATE_CATALOG and USE PROVIDER privileges for your Unity Catalog metastore, or a user who has both the CREATE_CATALOG privilege and ownership of the provider object.

The ability to grant read-only access to the schemas (databases), tables, views, and volumes in the catalog created from the share follows the typical Unity Catalog privilege hierarchy. The ability to view notebooks in the catalog created from the share requires the USE CATALOG privilege on the catalog. See Manage permissions for the schemas, tables, and volumes in a Delta Sharing catalog.

View providers and shares

To start reading the data that has been shared with you by a data provider, you need to know the name of the provider and share objects that are stored in your Unity Catalog metastore once the provider has shared data with you.

The provider object represents the Unity Catalog metastore, cloud platform, and region of the organization that shared the data with you.

The share object represents the tables, volumes, and views that the provider has shared with you.

View all providers who have shared data with you

To view a list of available data providers, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the SHOW PROVIDERS SQL command in an Azure Databricks notebook or the Databricks SQL query editor.

Permissions required: You must be a metastore admin or have the USE PROVIDER privilege. Other users have access only to the providers and provider shares that they own.

For details, see View providers.

View provider details

To view details about a provider, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the DESCRIBE PROVIDER SQL command in an Azure Databricks notebook or the Databricks SQL query editor.

Permissions required: You must be a metastore admin, have the USE PROVIDER privilege, or own the provider object.

For details, see View provider details.

View shares

To view the shares that a provider has shared with you, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or the SHOW SHARES IN PROVIDER SQL command in an Azure Databricks notebook or the Databricks SQL query editor.

Permissions required: You must be a metastore admin, have the USE PROVIDER privilege, or own the provider object.

For details, see View shares that a provider has shared with you.

Access data in a shared table or volume

To read data in a shared table or volume:

A privileged user must create a catalog from the share that contains the table or volume. This can be a metastore admin, a user who has both the CREATE_CATALOG and USE PROVIDER privileges for your Unity Catalog metastore, or a user who has both the CREATE_CATALOG privilege and ownership of the provider object.
That user or a user with the same privileges must grant you access to the shared table or volume.
You can access the table or volume just as you would any other data asset registered in your Unity Catalog metastore.

To make the data in a share accessible to your team, you must create a catalog from the share. To create a catalog from a share, you can use Catalog Explorer, the Databricks Unity Catalog CLI, or SQL commands in an Azure Databricks notebook or the Databricks SQL query editor.

Permissions required: A metastore admin, a user who has both the CREATE_CATALOG and USE PROVIDER privileges for your Unity Catalog metastore, or a user who has both the CREATE_CATALOG privilege and ownership of the provider object.

Note

If the share includes views, you must use a catalog name that is different than the name of the catalog that contains the view in the provider’s metastore.

Catalog Explorer

In your Azure Databricks workspace, click Catalog to open Catalog Explorer.
At the top of the Catalog pane, click the gear icon and select Delta Sharing.

Alternatively, from the Quick access page, click the Delta Sharing > button.
On the Shared with me tab, find and select the provider.
On the Shares tab, find the share and click Create catalog on the share row.
Enter a name for the catalog and optional comment.
Click Create.

Alternatively, when you open Catalog Explorer, you can click Create Catalog in the upper right to create a shared catalog. See Create catalogs.

SQL

Run the following command in a notebook or the Databricks SQL query editor.

CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;

CLI

databricks catalogs create <catalog-name> /
--provider-name <provider-name> /
--share-name <share-name>

The catalog created from a share has a catalog type of Delta Sharing. You can view the type on the catalog details page in Catalog Explorer or by running the DESCRIBE CATALOG SQL command in a notebook or Databricks SQL query. All shared catalogs are listed under Catalog > Shared in the Catalog Explorer left pane.

A Delta Sharing catalog can be managed in the same way as regular catalogs on a Unity Catalog metastore. You can view, update, and delete a Delta Sharing catalog using Catalog Explorer, the Databricks CLI, and by using SHOW CATALOGS, DESCRIBE CATALOG, ALTER CATALOG, and DROP CATALOG SQL commands.

The 3-level namespace structure under a Delta Sharing catalog created from a share is the same as the one under a regular catalog on Unity Catalog: catalog.schema.table or catalog.schema.volume.

Table and volume data under a shared catalog is read-only, which means you can perform read operations like:

DESCRIBE, SHOW, and SELECT for tables.
DESCRIBE VOLUME, LIST <volume-path>, SELECT * FROM <format>.'<volume_path>', and COPY INTO for volumes.

Notebooks in a shared catalog can be previewed and cloned by any user with USE CATALOG on the catalog.

Models in a shared catalog can be read and loaded for inference by any user with the following privileges: EXECUTE privilege on the registered model, plus USE SCHEMA and USE CATALOG privileges on the schema and catalog containing the model.

By default, the catalog creator is the owner of all data objects under a Delta Sharing catalog and can manage permissions for any of them.

Privileges are inherited downward, although some workspaces may still be on the legacy security model that did not provide inheritance. See Inheritance model. Any user granted the SELECT privilege on the catalog will have the SELECT privilege on all of the schemas and tables in the catalog unless that privilege is revoked. Likewise, any user granted the READ VOLUME privilege on the catalog will have the READ VOLUME privilege on all of the volumes in the catalog unless that privilege is revoked. You cannot grant privileges that give write or update access to a Delta Sharing catalog or objects in a Delta Sharing catalog.

The catalog owner can delegate the ownership of data objects to other users or groups, thereby granting those users the ability to manage the object permissions and life cycles.

For detailed information about managing privileges on data objects using Unity Catalog, see Manage privileges in Unity Catalog.

Read data in a shared table

You can read data in a shared table using any of the tools available to you as an Azure Databricks user: Catalog Explorer, notebooks, SQL queries, the Databricks CLI, and Databricks REST APIs. You must have the SELECT privilege on the table.

Read data in a shared volume

You can read data in a shared volume using any of the tools available to you as an Azure Databricks user: Catalog Explorer, notebooks, SQL queries, the Databricks CLI, and Databricks REST APIs. You must have the READ VOLUME privilege on the volume.

Load a shared model for inference

For details on loading a shared model and using it for batch inference, see Load model version by alias for inference workloads.

Query a table’s history data

If history is shared along with the table, you can query the table data as of a version or timestamp. Requires Databricks Runtime 12.2 LTS or above.

For example:

SELECT * FROM vaccine.vaccine_us.vaccine_us_distribution VERSION AS OF 3;
SELECT * FROM vaccine.vaccine_us.vaccine_us_distribution TIMESTAMP AS OF "2023-01-01 00:00:00";

In addition, if the change data feed (CDF) is enabled with the table, you can query the CDF. Both version and timestamp are supported:

SELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', 0, 3);
SELECT * FROM table_changes('vaccine.vaccine_us.vaccine_us_distribution', "2023-01-01 00:00:00", "2022-02-01 00:00:00");

For more information about change data feed, see Use Delta Lake change data feed on Azure Databricks.

Query a table using Apache Spark Structured Streaming

If a table is shared with history, you can use it as the source for Spark Structured Streaming. Requires Databricks Runtime 12.2 LTS or above.

Supported options:

ignoreDeletes: Ignore transactions that delete data.
ignoreChanges: Re-process updates if files were rewritten in the source table due to a data changing operation such as UPDATE, MERGE INTO, DELETE (within partitions), or OVERWRITE. Unchanged rows can still be emitted. Therefore your downstream consumers should be able to handle duplicates. Deletes are not propagated downstream. ignoreChanges subsumes ignoreDeletes. Therefore, if you use ignoreChanges, your stream will not be disrupted by either deletions or updates to the source table.
startingVersion: The shared table version to start from. All table changes starting from this version (inclusive) will be read by the streaming source.
startingTimestamp: The timestamp to start from. All table changes committed at or after the timestamp (inclusive) will be read by the streaming source. Example: "2023-01-01 00:00:00.0"
maxFilesPerTrigger: The number of new files to be considered in every micro-batch.
maxBytesPerTrigger: The amount of data that gets processed in each micro-batch. This option sets a “soft max”, meaning that a batch processes approximately this amount of data and might process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit.
readChangeFeed: Stream read the change data feed of the shared table.

Unsupported options:

Trigger.availableNow

Sample Structured Streaming queries

Scala

spark.readStream.format("deltaSharing")
.option("startingVersion", 0)
.option("ignoreChanges", true)
.option("maxFilesPerTrigger", 10)
.table("vaccine.vaccine_us.vaccine_us_distribution")

Python

spark.readStream.format("deltaSharing")\
.option("startingVersion", 0)\
.option("ignoreDeletes", true)\
.option("maxBytesPerTrigger", 10000)\
.table("vaccine.vaccine_us.vaccine_us_distribution")

If change data feed (CDF) is enabled with the table, you can stream read the CDF.

spark.readStream.format("deltaSharing")
.option("readChangeFeed", "true")
.table("vaccine.vaccine_us.vaccine_us_distribution")

Read tables with deletion vectors or column mapping enabled

Important

This feature is in Public Preview.

Deletion vectors are a storage optimization feature that your provider can enable on shared Delta tables. See What are deletion vectors?.

Azure Databricks also supports column mapping for Delta tables. See Rename and drop columns with Delta Lake column mapping.

If your provider shared a table with deletion vectors or column mapping enabled, you can perform batch reads on the table using a SQL warehouse or a cluster running Databricks Runtime 14.1 or above. CDF and streaming queries require Databricks Runtime 14.2 or above.

You can perform batch queries as-is, because they can automatically resolve responseFormat based on the table features of the shared table.

To read a change data feed (CDF) or to perform streaming queries on shared tables with deletion vectors or column mapping enabled, you must set the additional option responseFormat=delta.

The following examples show batch, CDF, and streaming queries:

import org.apache.spark.sql.SparkSession

// Batch query
spark.read.format("deltaSharing").table(<tableName>)

// CDF query
spark.read.format("deltaSharing")
  .option("readChangeFeed", "true")
  .option("responseFormat", "delta")
  .option("startingVersion", 1)
  .table(<tableName>)

// Streaming query
spark.readStream.format("deltaSharing").option("responseFormat", "delta").table(<tableName>)

Read shared views

Important

This feature is in Public Preview.

Reading shared views is the same as reading shared tables, with these exceptions:

View sharing restrictions:

You cannot share views that reference shared tables or shared views.

Naming requirements:

The catalog name that you use for the shared catalog that contains the view cannot be the same as any provider catalog that contains a table referenced by the view. For example, if the shared view is contained in your test catalog, and one of the provider’s tables referenced in that view is contained in the provider’s test catalog, the query will result in a namespace conflict error. See Create a catalog from a share.

History and streaming:

You cannot query history or use a view as a streaming source.

View support in open sharing:

The instructions in this article focus on reading shared data using Azure Databricks user interfaces, specifically Unity Catalog syntax and interfaces. You can also query shared views using Apache Spark, Python, and BI tools like Tableau and Power BI.

Costs:

There are two potential sources of cost for view sharing:

Compute cost, charged by Databricks.
Storage and network transfer (egress) cost, charged by the storage vendor.

Compute cost is based on the type of recipient compute resource:

Recipient compute	Who pays?	SKU
Databricks Serverless	Recipient	Serverless SKU used by the recipient
Databricks Classic	Recipient	Interactive Serverless
Open Delta Sharing connectors	Provider	Interactive Serverless

Read shared notebooks

To preview and clone shared notebook files, you can use Catalog Explorer.

Permissions required: Catalog owner or user with the USE CATALOG privilege on the catalog created from the share.

In your Azure Databricks workspace, click Catalog.
In the left pane, expand the Catalog menu, find and select the catalog created from the share.
On the Other assets tab, you’ll see any shared notebook files.
Click the name of a shared notebook file to preview it.
(Optional) Click the Clone button to import the shared notebook file to your workspace.
1. On the Clone to dialog, optionally enter a New name, then select the workspace folder you want to clone the notebook file to.
2. Click Clone.
3. Once the notebook is cloned, a dialog pops up to let you know that it successfully cloned. Click reveal in the notebook editor on the dialog to view it in the notebook editor.
See Introduction to Databricks notebooks.

Zdieľať cez