How to solve Azure Batch node out of memory?

aot 56 Reputation points
2023-01-17T10:16:34.6933333+00:00

We're using Azure Batch to run a large number of custom job (python script) through our Azure Data Factory. We are encountering an issue where our pool node (VM=dsvm-win-2019) will process our datasets, but eventually it crashes with an FailureExitCode:1. Looking at the stderr.txt from the job, we get the following error:

ImportError: DLL load failed while importing <module>: Not enough memory resources are available to process this command.

This suggests to us that the VM is somehow out of memory. Each custom job we launch will create its own task, and in our datafactory we have the Retention Time In Days set to 1. Our batch custom jobs will run for a couple of days before the VM crashes, and the task cleanup seems to be happening.

As a start task, we are running a pip install of a few needed python modules.

We're a bit at a loss for how to troubleshoot this. Current solution is to spin up a new pool when the crash happens, and restart our pipeline, but we'd prefer a situation where we don't need to run till failure.

We are puzzled as to how we get into this situation in the first place.

How can we avoid the out of memory issue for our VMs?

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
301 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,539 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 77,081 Reputation points Microsoft Employee
    2023-01-18T08:49:57.9866667+00:00

    Hello @aot,

    Thanks for the question and using MS Q&A platform.

    This error message is indicating that the Azure Batch task in your Azure Data Factory pipeline is running out of memory resources. This can happen if the task is processing a large amount of data or if there are other tasks running concurrently that are also using a lot of memory.

    There are a few things you can try to resolve this issue:

    1. Increase the amount of memory allocated to the Azure Batch task in your pipeline.
    2. Scale up the number of nodes in your Azure Batch pool to increase the overall amount of memory available.
    3. Optimize your code to reduce the amount of memory it uses.
    4. Schedule your pipeline at off-peak hours to reduce contention for resources.
    5. Monitor the pipeline closely and adjust resources accordingly.

    Keep in mind that this error might also be caused by other factors such as network issues, configuration errors, or insufficient storage. So, it's recommended to go through the pipeline and check that everything is set up correctly.

    Hope this helps. Do let us know if you any further queries.


    Please don’t forget to Accept Answer wherever the information provided helps you, this can be beneficial to other community members.


  2. aot 56 Reputation points
    2023-02-10T09:59:26.14+00:00

    Hello @PRADEEPCHEEKATLA-MSFT

    After being in touch with MS Support, it was realised that the core of the problem was that the node VM ran out of Virtual Memory. The solution in our case was to utilize the Azure Batch autoscaling feature to ro a regular scale-in of the batch pool to 0 nodes, and then scale out again later, as tasks were being submitted.

    Our autoscale formula was:

    "resizeTimeout": "PT15M",
        "currentDedicatedNodes": 0,
        "currentLowPriorityNodes": 0,
        "targetDedicatedNodes": 0,
        "targetLowPriorityNodes": 0,
        "enableAutoScale": true,
        "autoScaleFormula": "startingNumberOfVMs = 1;
    maxNumberofVMs = 25;
    pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
    pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs  :avg($PendingTasks.GetSample(180 * TimeInterval_Second));
    $TargetDedicatedNodes=min(maxNumberofVMs, pendingTaskSamples);
    $NodeDeallocationOption = taskcompletion;",
        "autoScaleEvaluationInterval": "PT15M",