designing a partition-based ingestion strategy

Question

designing a partition-based ingestion strategy

Janice Chi 140

We are currently designing a partition-based ingestion strategy from IBM DB2 (on-prem) to ADLS Gen2 using Azure Data Factory (ADF). Since DB2 is a database and does not have a file system, we cannot use ADF’s Get Metadata activity to estimate partition sizes like we typically do with files in blob or ADLS storage.

Our plan is to categorize data partitions as SMALL, MEDIUM, or LARGE based on estimated row counts or data volume, and then route each category to a different ADF Copy Activity optimized with distinct DIU settings and degree of parallelism.

Questions:

What is Microsoft’s recommended approach for dynamically estimating the size of each DB2 partition (e.g., row count or data volume) at runtime within ADF pipelines?

Can we use a Lookup + Script/Stored Proc + Set Variable strategy to update the control table with the size category and then use a switch activity to trigger the appropriate copy activity?

Are there any built-in ADF activities or recommended patterns for classifying database partitions by size before ingestion, given that Get Metadata doesn’t work for structured sources?

Any suggestions for tuning copy performance per size category (e.g., SMALL=2 DIUs, MEDIUM=8, LARGE=16+)?

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-11T03:53:40.8566667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2025-06-11T03:53:40.8566667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Janice Chi

What is Microsoft’s recommended approach for dynamically estimating the size of each DB2 partition (e.g., row count or data volume) at runtime within ADF pipelines?

Since ADF’s Get Metadata activity doesn't support structured sources like IBM DB2, Microsoft recommends using SQL-based metadata queries as part of a metadata-driven ingestion framework.

Use a Lookup or stored procedure activity in ADF Run a partition-specific SQL query to count rows at runtime:

SELECT COUNT(*) AS row_count
FROM schema.table
WHERE partition_column BETWEEN @Start AND @End;

This gives a dynamic estimate of the data volume in that specific partition.

Write results to control table

Store partition_id, row_count, and size_category (e.g., SMALL, MEDIUM, LARGE) in a metadata/control table. This enables classification and downstream orchestration.

3.Use ADF control flow to act on size

Use Set Variable to assign size category based on row count.
Use Switch activity to trigger Copy activities with different DIU/parallelism settings.

Microsoft recommends using SQL queries via Lookup or Stored Proc to estimate row count per partition, then storing and acting on that metadata within ADF control flow logic.

Can we use a Lookup + Script/Stored Proc + Set Variable strategy to update the control table with the size category and then use a switch activity to trigger the appropriate copy activity?

Yes, this is a recommended pattern and is commonly used in metadata-driven pipelines.

Pattern:

Lookup/Stored Proc Activity → Run COUNT(*) or size estimation logic.

Set Variable Activity → Assign size category (SMALL, MEDIUM, LARGE) based on thresholds.

Switch Activity → Route flow to different Copy activities based on size.

This gives you dynamic, runtime branching.

Are there any built-in ADF activities or recommended patterns for classifying database partitions by size before ingestion, given that Get Metadata doesn’t work for structured sources?

There is no direct built-in ADF activity to classify SQL partitions by data size like you would with file systems. However:

Recommended approach:

Implement partitioning logic in a control table.
Use custom SQL logic + Lookup activity to dynamically classify partitions.
Rely on metadata-driven pipeline design for scalability and maintainability.

Suggestions for tuning copy performance per partition size

Any suggestions for tuning copy performance per size category (e.g., SMALL=2 DIUs, MEDIUM=8, LARGE=16+)?


Partition Size	Recommended DIUs	Parallelism	Tips
SMALL (<1M rows)	2	2	Inline copy OK
MEDIUM (1–10M)	8	4–6	Enable staged copy
LARGE (>10M)	16+	8–16	Use high concurrency + batch reads

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

designing a partition-based ingestion strategy

1 answer

Your answer