Batch Cross-Tenant Azure SQL Database Synchronization Using Azure Databricks

Question

Batch Cross-Tenant Azure SQL Database Synchronization Using Azure Databricks

Mallikarjun appani 6

We would like to confirm whether the following approaches are supported and recommended:

Timestamp/Watermark-based incremental loads using columns such as LastModifiedDate
ID-based incremental loads using increasing primary key values

Our planned approach is:

Read incremental data from the source Azure SQL Database
Process the data in Azure Databricks
Perform batch upsert/merge operations into the target Azure SQL Database across tenants

Could you please confirm:

Whether this architecture is supported for cross-tenant scenarios
Recommended authentication approach using Service Principals/Managed Identity
Best practices for secure connectivity (Private Endpoints/VNet Injection)
Any limitations or recommended patterns for scalable incremental synchronization

Additionally, we would like guidance on the preferred approach for handling updates and deletes efficiently in this architecture.We would like to confirm whether the following approaches are supported and recommended:

Timestamp/Watermark-based incremental loads using columns such as LastModifiedDate
ID-based incremental loads using increasing primary key values

Our planned approach is:

Read incremental data from the source Azure SQL Database
Process the data in Azure Databricks
Perform batch upsert/merge operations into the target Azure SQL Database across tenants

Could you please confirm:

Whether this architecture is supported for cross-tenant scenarios
Recommended authentication approach using Service Principals/Managed Identity
Best practices for secure connectivity (Private Endpoints/VNet Injection)
Any limitations or recommended patterns for scalable incremental synchronization

Additionally, we would like guidance on the preferred approach for handling updates and deletes efficiently in this architecture.

0 comments

2 answers

Your answer

Answer 1

kagiyama yutaka 2,910

I think this cross‑tenant setup is supported when u use an AAD service principal and private endpoints, and read only rows newer than a stored timestamp or ID before applying a small idempotent MERGE for updates and deletes. Keeping that watermark in a control table is the recommended pattern for scalable incremental sync across tenants.

0 comments

Answer 2

Timestamp/watermark-based and ID-based incremental patterns are supported and explicitly recommended in Microsoft data architectures, including cross-tenant scenarios, when implemented with proper change detection, watermarking, and MERGE patterns.

Key points from the provided context that apply to the scenario:

Supported incremental patterns
- Watermark-based incremental loading is a standard pattern for moving data between systems and layers:
  - Uses a monotonically increasing column such as LastModifiedDate or similar to track the last processed value.
  - Stores the last processed watermark in a control/metadata table and only processes rows greater than that value in subsequent runs.
  - This pattern is used both for source-to-bronze and bronze-to-silver transitions and is explicitly called out as a best practice.
- ID-based incremental loads using increasing primary key values are also supported as a watermark-based pattern when the integer ID is monotonically increasing and comparable.
- Watermarking is called out as “an important concept in incremental copy processes” and is combined with MERGE and control tables for safe restarts and operational resilience.
Cross-tenant architecture support
- Cross-tenant architectures that consolidate analytics and data processing in a target tenant are supported. The architecture for migrating workloads across security tenants explicitly includes:
  - An analytics platform (Microsoft Fabric in the context) that consolidates data pipelines and compute in the target tenant.
  - PaaS compute (such as Azure Databricks) recreated in the target tenant.
  - Data services (such as Azure SQL and Azure Cosmos DB) backed up/restored or synchronized across tenants.
- This validates the general pattern of:
  1. Reading from a source data service in one tenant.
  2. Processing in a centralized compute platform (Databricks/Fabric) in another tenant.
  3. Writing into a target data service in the target tenant.
Authentication and identity patterns
- For operational data stores and incremental ingestion, best practices emphasize:
  - Use of service identities (service principals or managed identities) with clear control tables and idempotent design.
  - Git-based CI/CD and DevOps practices for managing pipelines and SQL changes.
- While the context does not spell out a specific cross-tenant auth configuration, the pattern assumes:
  - A non-interactive identity (service principal or managed identity) that has been granted access to the source and target databases.
  - Use of this identity by the data pipelines / compute engine to perform incremental reads and MERGE operations.
Secure connectivity and network patterns
- The cross-tenant migration architecture includes:
  - PaaS data services (Azure SQL, Azure Cosmos DB) and PaaS compute (Azure Databricks, Azure Machine Learning) as part of the solution.
  - Azure Key Vault for secure secret storage.
  - Azure Event Hubs and other services for real-time or streaming scenarios.
- This implies the recommended secure connectivity patterns:
  - Use of private connectivity (for example, Private Endpoints) to data services.
  - Use of secure secret storage (Key Vault) for connection strings and credentials.
  - Network isolation and controlled access paths between compute and data services.
Patterns and limitations for scalable incremental synchronization
- Best practices for an operational data store and incremental ingestion include:
  - Batch and incremental loads orchestrated through pipelines with CDC-enabled connectors or watermarking.
  - Watermarking and retry logic to ensure consistency.
  - Idempotent design combining change detection, watermarking, T-SQL MERGE, and control tables.
  - Performing heavy transformations upstream (for example, in Spark/Databricks) and reserving SQL for final MERGE operations.
- For large-scale incremental processing in medallion architectures:
  - Change Data Feed (CDF) is used between layers to process only inserts/updates/deletes since the last commit version.
  - Commit version tracking tables (for example, ChangeDataFeedState) are used to track the last processed version.
  - Idempotent MERGE patterns are used to apply changes to target tables.
- These patterns translate directly to scalable cross-tenant synchronization:
  - Maintain per-table watermark/commit state in a control table.
  - Process only deltas in each batch.
  - Use MERGE with insert/update/delete logic to keep the target in sync.
Handling updates and deletes efficiently
- For updates:
  - Use watermark-based or CDC-based detection of changed rows.
  - Apply changes via idempotent MERGE operations in the target SQL database.
- For deletes:
  - When using CDC or CDF, delete operations are explicitly captured and can be applied downstream.
  - When using pure watermark-based incremental loads from a source that does not expose deletes, deletes are not detected by watermark alone; in such cases:
    - Either rely on CDC/Change Data Feed where available, or
    - Use periodic full compare-and-sync for specific dimensions or tables where deletes must be reconciled.
- The documented patterns for Dynamics 365 and Delta CDF show:
  - Full compare-and-sync for certain dimension tables to handle deletes.
  - MERGE operations with INSERT/UPDATE/DELETE logic when change feeds include delete events.

Summary of how this maps to the proposed architecture:

Timestamp/Watermark-based incremental loads using LastModifiedDate and similar columns are supported and recommended.
ID-based incremental loads using increasing primary key values are supported as a watermark pattern when the ID is monotonically increasing and comparable.
Cross-tenant architecture with Databricks (or similar compute) reading from a source Azure SQL Database and writing to a target Azure SQL Database is supported and aligns with the cross-tenant migration architecture.
Authentication should use non-interactive identities (service principals or managed identities) with appropriate permissions on both source and target databases, combined with secure secret storage (for example, Key Vault).
Secure connectivity should use private connectivity to data services and controlled network paths between compute and data services, as in the cross-tenant architecture.
Scalable incremental synchronization should follow watermarking/CDC, control tables for last processed values, idempotent MERGE patterns, and upstream transformations in Databricks.
Updates and deletes should be handled via CDC/CDF where available, or via MERGE with delete logic and, where necessary, periodic full compare-and-sync for specific tables.

References:

Share via

Batch Cross-Tenant Azure SQL Database Synchronization Using Azure Databricks

2 answers

Your answer