Recommended approach to update reconciliation log tables during DB2-to-Azure SQL Hyperscale migration using ADF + Databricks

Janice Chi 160 Reputation points
2025-07-04T08:12:46.2533333+00:00

We are migrating data from IBM DB2 to Azure SQL Hyperscale using a combination of Azure Data Factory (ADF) and Azure Databricks. As part of our reconciliation framework, we calculate row counts and hash values within Databricks after each major stage (Extract, Transform, Load).

Environment:

Source: IBM DB2 (on-prem)

Destination: Azure SQL Hyperscale

Processing: Azure Databricks (PySpark)

  • Orchestration: Azure Data Factory (V2)

We also maintain a specialized reconciliation log table in Azure SQL Hyperscale to store these metrics, which will be used for validation and audit purposes.

We are considering two approaches to update this log table:

Option A: Write directly to the log table from Databricks using JDBC after each operation.

Option B: Pass the computed values to ADF and have ADF call a stored procedure in Hyperscale to update the log table.


Question: Which of the above approaches is recommended by Microsoft for better robustness, operational maintainability, and failure recovery? Are there any known best practices or limitations in using stored procedures via ADF for logging reconciliation metadata?


Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,655 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Venkat Reddy Navari 3,625 Reputation points Microsoft External Staff Moderator
    2025-07-04T09:52:10.8066667+00:00

    Hi Janice Chi Great to see you're implementing a detailed reconciliation framework during your DB2 to Azure SQL Hyperscale migration. Both Option A (Databricks JDBC write) and Option B (ADF stored procedure call) are technically viable, but here's a breakdown based on robustness, maintainability, and failure recovery aligned with recommended Microsoft practices:

    Recommended Approach: Option B – Use ADF to call stored procedure

    Why Option B is preferred:

    • Operational Clarity & Separation of Concerns: Let Databricks focus purely on data transformation and reconciliation metric generation, and delegate logging/persistence to ADF. This simplifies error tracing and decouples concerns.
    • Better Failure Handling: ADF pipelines provide built-in retry policies, failure scopes, alerts, and granular activity-level monitoring when calling stored procedures.
    • Centralized Logging Logic: Logic inside stored procedures is versioned, centralized, and easier to audit or update without modifying code in Databricks notebooks.
    • Compliance and Audit-Readiness: Easier to enforce transactional boundaries (e.g., via BEGIN TRANSACTION) inside the SP for consistent audit logs.

    When Option A may still be valid:

    If extremely low-latency logging is needed or you're outside ADF's orchestration path (e.g., Databricks running independently).

    In scenarios requiring complex or conditional metadata writes based on in-memory logic only available in Databricks.

    Best Practices if using ADF + SP:

    Use Execute Stored Procedure activity in ADF with clear parameters for stage, row counts, hash, timestamp, etc.

    Ensure your stored proc handles idempotency (i.e., no duplicate inserts on retries).

    Log failures and outputs of the SP execution for traceability.

    Use custom status codes or logging tables for enhanced observability if needed.

    Docs reference (for ADF stored procedure activity): Azure Data Factory – Execute stored procedure


    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.