Monitor and manage Delta Sharing egress costs (for providers)
Article
This article describes tools that you can use to monitor and manage cloud vendor egress costs when you share data and AI assets using Delta Sharing.
Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. If you use Delta Sharing to share data and AI assets within a region, you incur no egress cost.
To monitor and manage egress charges, Databricks provides:
One approach to avoiding egress costs is for the provider to create and sync local replicas of shared data in regions that their recipients are using. Another approach is for recipients to clone the shared data to local regions for active querying, setting up syncs between the shared table and the local clone. This section discusses a number of replication patterns.
Use Delta deep clone for incremental replication
Providers can use DEEP CLONE to replicate Delta tables to external locations across the regions that they share to. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.
SQL
CREATETABLE [IFNOTEXISTS] table_name DEEP CLONE source_table_name
[TBLPROPERTIES clause] [LOCATION path];
You can schedule a Databricks job to refresh target table data incrementally with recent updates in the shared table, using the following command:
SQL
CREATEORREPLACETABLE table_name DEEP CLONE source_table_name;
Enable change data feed (CDF) on shared tables for incremental replication
When a table is shared with its CDF, the recipient can access the changes and merge them into a local copy of the table, where users perform queries. In this scenario, recipient access to the data does not cross region boundaries, and egress is limited to refreshing a local copy. If the recipient is on Databricks, they can use a Databricks workflow job to propagate changes to a local replica.
To share a table with CDF, you must enable CDF on the table and share it WITH HISTORY.
Use Cloudflare R2 replicas or migrate storage to R2
Cloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data using Delta Sharing without incurring egress fees. This section describes how to replicate data to an R2 location and enable incremental updates from source tables.
Requirements
Databricks workspace enabled for Unity Catalog.
Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above.
CREATE STORAGE CREDENTIAL privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.
CREATE EXTERNAL LOCATION privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have this privilege by default.
CREATE MANAGED STORAGE privilege on the external location.
CREATE CATALOG on the metastore. Metastore admins have this privilege by default.
Limitations for Cloudflare R2
Providers can’t share R2 tables that use liquid clustering and V2 checkpoint.
Mount an R2 bucket as an external location in Azure Databricks
Under Storage location, select Select a storage location and enter the path to the R2 bucket you defined as an external location. For example, r2://mybucket@my-account-id.r2.cloudflarestorage.com
SQL
Use the path to the R2 bucket you defined as an external location. For example:
SQL
CREATECATALOGIFNOTEXISTS my-r2-catalogMANAGED LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com'COMMENT'Location for managed tables and volumes to share using Delta Sharing';
Clone the data you want to share to a table in the new catalog
Use DEEP CLONE to replicate tables in Azure Data Lake Storage Gen2 to the new catalog that uses R2 for managed storage. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.
SQL
CREATETABLEIFNOTEXISTS new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table
LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com';
You can schedule a Databricks job to refresh target table data incrementally with recent updates in the source table, using the following command:
SQL
CREATEORREPLACETABLE new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table;
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.