Monitor and manage Delta Sharing egress costs (for providers)
This article describes tools that you can use to monitor and manage cloud vendor egress costs when you share data and AI assets using Delta Sharing.
Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. If you use Delta Sharing to share data and AI assets within a region, you incur no egress cost.
To monitor and manage egress charges, Databricks provides:
- Instructions for replicating data between regions to avoid egress fees.
- Support for Cloudflare R2 storage to avoid egress fees.
Replicate data to avoid egress costs
One approach to avoiding egress costs is for the provider to create and sync local replicas of shared data in regions that their recipients are using. Another approach is for recipients to clone the shared data to local regions for active querying, setting up syncs between the shared table and the local clone. This section discusses a number of replication patterns.
Use Delta deep clone for incremental replication
Providers can use DEEP CLONE
to replicate Delta tables to external locations across the regions that they share to. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.
CREATE TABLE [IF NOT EXISTS] table_name DEEP CLONE source_table_name
[TBLPROPERTIES clause] [LOCATION path];
You can schedule a Databricks job to refresh target table data incrementally with recent updates in the shared table, using the following command:
CREATE OR REPLACE TABLE table_name DEEP CLONE source_table_name;
See Clone a table on Azure Databricks and Schedule and orchestrate workflows.
Enable change data feed (CDF) on shared tables for incremental replication
When a table is shared with its CDF, the recipient can access the changes and merge them into a local copy of the table, where users perform queries. In this scenario, recipient access to the data does not cross region boundaries, and egress is limited to refreshing a local copy. If the recipient is on Databricks, they can use a Databricks workflow job to propagate changes to a local replica.
To share a table with CDF, you must enable CDF on the table and share it WITH HISTORY
.
For more information about using CDF, see Use Delta Lake change data feed on Azure Databricks and Add tables to a share.
Use Cloudflare R2 replicas or migrate storage to R2
Cloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data using Delta Sharing without incurring egress fees. This section describes how to replicate data to an R2 location and enable incremental updates from source tables.
Requirements
- Databricks workspace enabled for Unity Catalog.
- Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above.
- Cloudflare account. See https://dash.cloudflare.com/sign-up.
- Cloudflare R2 Admin role. See the Cloudflare roles documentation.
CREATE STORAGE CREDENTIAL
privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.CREATE EXTERNAL LOCATION
privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have this privilege by default.CREATE MANAGED STORAGE
privilege on the external location.CREATE CATALOG
on the metastore. Metastore admins have this privilege by default.
Mount an R2 bucket as an external location in Azure Databricks
Create a Cloudflare R2 bucket.
Create a storage credential in Unity Catalog that gives access to the R2 bucket.
Use the storage credential to create an external location in Unity Catalog.
See Create an external location to connect cloud storage to Azure Databricks.
Create a new catalog using the external location
Create a catalog that uses the new external location as its managed storage location.
See Create catalogs.
When you create the catalog, do the following:
Catalog Explorer
- Select a Standard catalog type.
- Under Storage location, select Select a storage location and enter the path to the R2 bucket you defined as an external location. For example,
r2://mybucket@my-account-id.r2.cloudflarestorage.com
SQL
Use the path to the R2 bucket you defined as an external location. For example:
CREATE CATALOG IF NOT EXISTS my-r2-catalog
MANAGED LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com'
COMMENT 'Location for managed tables and volumes to share using Delta Sharing';
Clone the data you want to share to a table in the new catalog
Use DEEP CLONE
to replicate tables in Azure Data Lake Storage Gen2 to the new catalog that uses R2 for managed storage. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.
CREATE TABLE IF NOT EXISTS new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table
LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com';
You can schedule a Databricks job to refresh target table data incrementally with recent updates in the source table, using the following command:
CREATE OR REPLACE TABLE new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table;
See Clone a table on Azure Databricks and Schedule and orchestrate workflows.
Share the new table
When you create the share, add the tables that are in the new catalog, stored in R2. The process is the same as adding any table to a share.