designing a partition-based ingestion strategy

Janice Chi 140 Reputation points
2025-06-10T14:07:07.7833333+00:00

We are currently designing a partition-based ingestion strategy from IBM DB2 (on-prem) to ADLS Gen2 using Azure Data Factory (ADF). Since DB2 is a database and does not have a file system, we cannot use ADF’s Get Metadata activity to estimate partition sizes like we typically do with files in blob or ADLS storage.

Our plan is to categorize data partitions as SMALL, MEDIUM, or LARGE based on estimated row counts or data volume, and then route each category to a different ADF Copy Activity optimized with distinct DIU settings and degree of parallelism.

Questions:

What is Microsoft’s recommended approach for dynamically estimating the size of each DB2 partition (e.g., row count or data volume) at runtime within ADF pipelines?

Can we use a Lookup + Script/Stored Proc + Set Variable strategy to update the control table with the size category and then use a switch activity to trigger the appropriate copy activity?

Are there any built-in ADF activities or recommended patterns for classifying database partitions by size before ingestion, given that Get Metadata doesn’t work for structured sources?

Any suggestions for tuning copy performance per size category (e.g., SMALL=2 DIUs, MEDIUM=8, LARGE=16+)?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator
    2025-06-10T15:17:16.2166667+00:00

    Hi @Janice Chi

    What is Microsoft’s recommended approach for dynamically estimating the size of each DB2 partition (e.g., row count or data volume) at runtime within ADF pipelines?

    Since ADF’s Get Metadata activity doesn't support structured sources like IBM DB2, Microsoft recommends using SQL-based metadata queries as part of a metadata-driven ingestion framework. 

    1. Use a Lookup or stored procedure activity in ADF Run a partition-specific SQL query to count rows at runtime: 
    SELECT COUNT(*) AS row_count
    FROM schema.table
    WHERE partition_column BETWEEN @Start AND @End;
    
    

    This gives a dynamic estimate of the data volume in that specific partition. 

    1. Write results to control table

    Store partition_id, row_count, and size_category (e.g., SMALL, MEDIUM, LARGE) in a metadata/control table. This enables classification and downstream orchestration. 

    3.Use ADF control flow to act on size

    • Use Set Variable to assign size category based on row count. 
    • Use Switch activity to trigger Copy activities with different DIU/parallelism settings. 

    Microsoft recommends using SQL queries via Lookup or Stored Proc to estimate row count per partition, then storing and acting on that metadata within ADF control flow logic. 

    Can we use a Lookup + Script/Stored Proc + Set Variable strategy to update the control table with the size category and then use a switch activity to trigger the appropriate copy activity?

    Yes, this is a recommended pattern and is commonly used in metadata-driven pipelines. 

    Pattern: 

    Lookup/Stored Proc Activity → Run COUNT(*) or size estimation logic. 

    Set Variable Activity → Assign size category (SMALL, MEDIUM, LARGE) based on thresholds. 

    Switch Activity → Route flow to different Copy activities based on size. 

    This gives you dynamic, runtime branching. 

    Are there any built-in ADF activities or recommended patterns for classifying database partitions by size before ingestion, given that Get Metadata doesn’t work for structured sources?

     

    There is no direct built-in ADF activity to classify SQL partitions by data size like you would with file systems. However: 

    Recommended approach: 

    • Implement partitioning logic in a control table. 
    • Use custom SQL logic + Lookup activity to dynamically classify partitions. 
    • Rely on metadata-driven pipeline design for scalability and maintainability. 

    Suggestions for tuning copy performance per partition size 

    Any suggestions for tuning copy performance per size category (e.g., SMALL=2 DIUs, MEDIUM=8, LARGE=16+)?

    Partition Size Recommended DIUs Parallelism Tips
    SMALL (<1M rows) 2 2 Inline copy OK
    MEDIUM (1–10M) 8 4–6 Enable staged copy
    LARGE (>10M) 16+ 8–16 Use high concurrency + batch reads

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

     

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.