Azure Batch Service - splitting or partitioning resource files

etl2016 1 Reputation point
2020-07-06T00:17:37.02+00:00

hi,

Is there a way to split or partition Azure Batch Service's Resource Files (input files) so that the burden of input file processing is load-balanced and distributed among all available compute nodes? For instance, if I have an input file with 100k rows, and my Azure Batch has 4 compute nodes, is it possible to split this into 25k row for each node?

If this partition option is available, how do I set this feature up? Should that be programmatically done? or should it be done through Azure Batch Service settings?

thank you

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
339 questions
{count} votes

1 answer

Sort by: Most helpful
  1. prmanhas-MSFT 17,906 Reputation points Microsoft Employee
    2020-07-06T09:27:56.947+00:00

    @etl2016-6749 Thank you for your question!!!

    There are few ways partitioning of job can be done.

    1. Partition of job is handled in the application side. So application divide the file into 4 jobs and keep track of progress of all 4 jobs and then can perform the reduce/aggregation operation on all 4 jobs.
      This is easier to handle at application side but difficult to scale in case there are multiple such jobs running simultaneously.
    2. You can perform the partitioning operation in the batch service also (as custom job). This job then can spawn 4 jobs and 1 reduce/aggregation job which is scheduled after completion of 4 jobs. (Dependent jobs are supported in batch service).

    This is easier to scale but would require handling in application side as the initial job would be complete but sub-jobs are still running.

    Hope this helps.

    Please 'Accept as answer' if the provided information is helpful, so that it can help others in the community looking for help on similar topics.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.