How to fix "ERROR: insufficient shared memory (shm)"?

Scheuplein, Joshua 5 Reputation points
2024-09-23T14:15:17.0966667+00:00

Hello,

I am facing the following error message during a DL training job using PyTorch and 4 NVIDIA A100 GPUs with 80 GB RAM each in a single compute node of type Standard_NC96ads_A100_v4:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/azureml-envs/azureml_39aa263cfb5a513ce373f7d1f5d6a8a2/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 363) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The training works as expected in the beginning. However, if the job gets interrupted due to the "low priority" of the compute node, the script fails at the start of the first epoch after resuming the training on a newly assigned compute node. Previously, I have successfully tested my training script and also the job resuming process by using only 2 GPUs. Can you help to explain this behavior and what exactly causes the OOM error w.r.t. shared memory?I will now try to use the following command in my job script to increase the limit for shared memory during the run:

job = command(
	# Some Arguments
	# ...
)

job.set_resources(shm_size="256g")

returned_job = ml_client.jobs.create_or_update(job)

See: https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.command?view=azure-python#azure-ai-ml-entities-command-set-resources

Is this the intended way how to solve this issue?

Best regards!

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,889 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Scheuplein, Joshua 5 Reputation points
    2024-09-26T15:23:43.5466667+00:00

    Yes, setting shm_size="32g" in the command() prompt solved the issue in my case.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.